Recognizing human action in videos at low resolution is of crucial importance for security monitoring and privacy protection. The previous methods usually carry out in a two-stage manner combined with super-resolution. The super-resolution module is trained with paired high-resolution (HR) and low-resolution (LR) action videos in the first step, then the output videos are used to train the action classification module subsequently. However, the practice guides super-resolution enhancement for promoting video visual quality rather than action recognition accuracy. Moreover, only low-resolution videos are usually available in the real world. We propose an end-to-end framework OSRO for low-resolution video human action recognition, which is trained in a one-stage pipeline with only low-resolution videos. Specifically, the super-resolution enhancement module and action classification module are cascaded to form a single-stream model, jointly optimized via our newly designed comprehensive loss LOSRO . Extensive experiments demonstrated that our model OSRO achieves excellent performance, which obtains an accuracy of 80.62% on the low-resolution UCF101 dataset, which surpasses the previous best method (Prog. DVSR) by 10.07%.
|