In this paper, we propose a novel approach for real-time human action recognition (HAR) on resource-constrained UAVs. Our approach tackles the limited availability of labeled UAV video data (compared to ground-based datasets) by incorporating synthetic data augmentation to improve the performance of a lightweight action recognition model. This combined strategy offers a robust and efficient solution for UAV-based HAR.We evaluate our method on the RoCoG v21 and UAV-Human2 datasets, showing a notable increase in top-1 accuracy across all scenarios on RoCoG: 9.1% improvement when training with synthetic data only, 6.9% with real data only, and the highest improvement of 11.8% with a combined approach. Additionally, using an X3D backbone further improves accuracy on the UAV-Human dataset by 5.5%. Our models deployed on a Qualcomm Robotics RB5 platform achieve real-time predictions at approximately 10 frames per second (fps) and demonstrate a superior trade-off between performance and inference rate on both low-power edge devices and high-end desktops.
|