In recent years, object tracking depends upon deep networks or complex architectures made great progress in accuracy, but less attention has been paid to tracking speed. On edge platforms with limited resources are used on UAVs, trackers are difficult to meet the requirements of real-time processing. To address the above problems, we proposed UAV object tracking based on vision transformer. Firstly, a lightweight neural network ResNet18 is used for feature extraction. Secondly, a transformer encoder is used to fully enhance fused features of the template features and the search features to improve feature expression, then, the enhanced features output from encoder are used for target prediction. Finally, further weighted fusion at decision level is used to obtain final tracking result, and a multi-supervised strategy is used. The proposed method achieves better results on three challenging large-scale tracking datasets, UAV123, LaSOT, and TrackingNet, reaching the success score(AUC) of 66.82, 64.26, and 81.2, respectively. In addition, our model tested speed on UAV123's test dataset, running at 116.2fps on NVIDIA 1660TI and 20.2fps on NVIDIA Jetson Xavier NX, which basically meets UAV's demand for real-time performance.
|