KEYWORDS: Data modeling, Object detection, Transformers, Education and training, Performance modeling, 3D modeling, Sensors, Visual process modeling, Linear filtering, Computer vision technology
Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic and real data remains a challenge. Existing methods usually build on top of baseline Convolutional Neural Network (CNN) models that have been shown to perform well when trained on real data, but have limited ability to perform well when trained on synthetic data. For example, some architectures allow for fine-tuning with the expectation of large quantities of training data and are prone to overfitting on synthetic data. Related work usually ignores various best practices from object detection on real data, e.g. by training on synthetic data from a single environment with relatively little variation. In this paper we propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images. Based on the state of the art, we incorporate data augmentation methods and a Transformer backbone. Besides reaching relatively strong performance without any specialized synthetic data transfer methods, we show that our methods improve the state of the art on synthetic data trained object detection for the RarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an in-house vehicle detection dataset.
Combining data from multiple sensors to improve the overall robustness and reliability of a classification system has become crucial in many applications, from military surveillance and decision support, to autonomous driving, robotics, and medical imaging. This so-called sensor fusion is especially interesting for fine-grained target classification, in which very specific sub-categories (e.g. ship types) need to be distinguished, a task that can be challenging with data from a single modality. Typical modalities are electro-optical (EO) image sensors, that can provide rich visual details of an object of interest, and radar, that can yield additional spatial information. Defined by the approach used to combine data from these sensors, several fusion techniques exist. For example, late fusion can merge class probabilities outputted by separate processing pipelines dedicated to each of the individual sensor data. In particular, deep learning (DL) has been widely leveraged for EO image analysis, but typically requires a lot of data to adapt to the nuances of a fine-grained classification task. Recent advances in DL on foundation models have shown a high potential when dealing with in-domain data scarcity, especially in combination with few-shot learning. This paper presents a framework to effectively combine EO and radar sensor data, and shows how this method outperforms stand-alone single sensor methods for fine-grained target classification. We adopt a strong few-shot image classification baseline based on foundation models, which robustly handles the lack of in-domain data and exploits rich visual features. In addition, we investigate a weighted and a Bayesian fusion approach to combine target class probabilities outputted by the image classification model and radar kinematic features. Experiments with data acquired in a measurement campaign at the port of Rotterdam show that our fusion method improves on the classification performance of individual modalities.
Automated object detection is becoming more relevant in a wide variety of applications in the military domain. This includes the detection of drones, ships, and vehicles in video and IR video. In recent years, deep learning based object detection methods, such as YOLO, have shown to be promising in many applications for object detection. However, current methods have limited success when objects of interest are small in number of pixels, e.g. objects far away or small objects closer by. This is important, since accurate small object detection translates to early detection and the earlier an object is detected the more time is available for action. In this study, we investigate novel image analysis techniques that are designed to address some of the challenges of (very) small object detection by taking into account temporal information. We implement six methods, of which three are based on deep learning and use the temporal context of a set of frames within a video. The methods consider neighboring frames when detecting objects, either by stacking them as additional channels or by considering difference maps. We compare these spatio-temporal deep learning methods with YOLO-v8 that only considers single frames and two traditional moving object detection methods. Evaluation is done on a set of videos that encompasses a wide variety of challenges, including various objects, scenes, and acquisition conditions to show real-world performance.
Electro-optical (EO) sensors are essential for surveillance in military and security applications. Recent technological advancements, especially the developments in Deep Learning (DL), have enabled improved object detection and tracking in complex and dynamic environments. Most of this research focuses on readily available visible light (VIS) images. To apply these technologies for Thermal infrared (TIR) imagery, DL networks can be retrained using image data in the TIR domain. However, such a training set with enough samples is not easily available. This paper presents an unsupervised domain adaptation method for ship detection in TIR imagery using paired VIS and TIR images. The proposed method leverages on the pairing of VIS and TIR images and performs domain adaptation using detections in the VIS imagery as ground-truth to provide data for the TIR domain learning. The method performs ship detection from the VIS images using a pretrained convolutional neural network (CNN). These detections are subsequently improved using a tracking algorithm. The proposed TIR object detection model follows a two-stage training process. In the first stage, the model's head is trained, which consists of the regression layers that output the bounding boxes of the detected objects. In the second stage, the model's feature extractor is trained to learn more discriminative features. The method is evaluated on a dataset of recordings at Rotterdam harbor. Experiments demonstrate that the resulting TIR detector performs comparably with its VIS counterpart, in addition to providing reliable detections in adverse environmental conditions where VIS model fails. The proposed method has significant potential for real-world applications, including maritime surveillance.
Deep learning has emerged as a powerful tool for image analysis in various fields including the military domain. It has the potential to automate and enhance tasks such as object detection, classification, and tracking. Training images for development of such models are typically scarce, due to the restricted nature of this type of data. Consequently, researchers have focused on using synthetic data for model development, since simulated images are fast to generate and can, in theory, make up a large and diverse data set. When using simulated training data it is important to consider the variety needed to bridge the gap between simulated and real data. So far it is not fully understood what variations are important and how much variation is needed. In this study, we investigate the effect of simulation variety. We do so for the development of a deep learning-based military vehicle detector that is evaluated on real-world images of military vehicles. To construct the synthetic training data, 3D models of the vehicles are placed in front of diverse background scenes. We experiment with the number of images, background scene variations, 3D model variations, model textures, camera-object distance, and various object rotations. The insight that we gain can be used to prioritize future efforts towards creating synthetic data for deep learning-based object detection models.
It is important to combat fraud on travel documents (e.g., passports), identity documents (e.g., ID-card) and breeder documents (e.g., birth certificates) to facilitate the travel of bona-fide travelers and to prevent criminal cross-border activities, such as terrorism, illegal migration, smuggling, and human trafficking. However, it is challenging and time consuming to verify all document security features manually. New technologies can assist in the automated fraud detection in these documents, which may result in faster and more consistent checks. This paper presents and evaluates four new technologies in automated document analysis. The first recognizes printing techniques. The second assists in the recognition of fraud in details. The third extracts information from the document, which can be used to detect anomalies at a tactical level. The fourth category concerns the analysis of travel patterns, using information from the visa pages in passports. The performance is assessed for each element with quantitative performance metrics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.