Recent studies have shown that deep neural networks are sensitive to visual blur. This vulnerability raises concerns about the robustness of learning-based methods in safety-critical applications, such as autonomous driving and industrial inspection. Although many methods have demonstrated the potential to enhance the robustness of vision tasks in blurry scenes, a standard benchmark for comparing and validating their effectiveness is still lacking. To address this, we propose a new benchmark for the comprehensive evaluation of visual recognition tasks in blurry scenes. The benchmark dataset includes a large number of scenes with different blur types, enabling multi-dimensional evaluation and comparison of existing methods. The existing methods are integrated into a unified evaluation framework for comparison, and the mechanism of robustness improvement is roughly categorized as reducing image-level shift and distribution-level shift. Through this unified framework, we identify the strengths and weaknesses of current methods in visual recognition tasks under blurry conditions. Furthermore, we propose a prior knowledge-based regularization module that can be easily incorporated into existing methods, which consistently boosts performance on visual recognition tasks in blurry scenes.
A training scheme called region-refocusing (RR) is proposed to improve the accuracy and accelerate the convergence of compact one-stage detection neural networks. Main contributions are as follows: (1) the RR mask is first proposed to incorporate the position information and the significance of objects, whereby the regions containing objects can be learned selectively by the compact student detector, which leads to more reasonable feature expressions; (2) within the RR training framework, the selected objectness features from the large teacher detector are utilized to enrich the supervision information and enhance the loss functions for training the student detector, which eventually contributes to rapid convergence and accurate detection; (3) by virtue of the RR scheme, the mean average precision (mAP) of the compact detector can be significantly improved even if the model is initialized from scratch. Superiority of RR has been verified on several benchmark data sets in comparison with other training schemes; the mAP of the well-known tiny-YOLOv2 can be improved from 57.4% to 63.8% by 6.4 points on the VOC2007 test set when the weights are pretrained on ImageNet. Remarkably, when the pretraining process is omitted, it yields a significant boost of mAP by 22.6 points compared with plain training scheme, which demonstrates the robustness and high efficiency of the RR training scheme. Meanwhile, the compact one-stage detector trained with our framework is competent to be deployed on resource-constrained devices for the competitive precision as well as having a lower requirement for computing power.
As convolutional neural networks have demonstrated state-of-the-art performance in object recognition and detection, there is a growing need for deploying these systems on resource-constrained mobile platforms. However, the computational burden and energy consumption of inference for these networks are significantly higher than what most low-power devices can afford. To address these limitations, this paper proposes a method to train object detection networks with low-precision weights and activations. The probability density functions of weights and activations of each layer are first directly estimated using piecewise Gaussian models. Then, the optimal quantization intervals and step sizes for each convolution layer are adaptively determined according to the distribution of weights and activations. As the most computationally expensive convolutions can be replaced by effective fixed point operations, the proposed method can drastically reduce computation complexity and memory footprint. Performing on the tiny you only look once (YOLO) and YOLO architectures, the proposed method achieves comparable accuracy to their 32-bit counterparts. As an illustration, the proposed 4-bit and 8-bit quantized versions of the YOLO model achieve a mean average precision of 62.6% and 63.9%, respectively, on the Pascal visual object classes 2012 test dataset. The mAP of the 32-bit full-precision baseline model is 64.0%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.