KEYWORDS: Object detection, Visual process modeling, Education and training, X-rays, X-ray imaging, Visualization, Systems modeling, Data modeling, Information visualization, Image enhancement
Recent Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, such as image captioning and question answering. However, they lack the essential perception ability, namely object detection. In this work, we focus on detecting prohibited items and discuss the possibility of integrating multimodal LLMs into the detection process. Our method first performs image captioning on the x-ray prohibited item image, followed by creating instructions to prompt the multimodal LLMs to identify the prohibited item. Our approach leverages the contextual understanding and language processing strengths of MLLMs. While current methods in real-time object detection having high accuracy, they often require extensive training on large datasets specific to the prohibited items. In contrast, MLLMs can understand and generate detailed descriptions, which can be advantageous in scenarios where prohibited items may not be well-represented in training data or exhibit significant variability in appearance. Our results suggest that MLLMs can complement traditional methods by providing a more nuanced understanding of prohibited items through their ability to interpret and respond to complex queries, potentially improving detection rates in challenging environments.
The illicit trade of prohibit products poses significant health and economic challenges globally, prompting the need for more effective detection methods in express shipment inspections. This study introduces a novel multi-dimensional approach to prohibit package detection by combining X-ray scanning imagery with simulated express shipment information, leveraging the capabilities of both computer vision and large language models (LLMs). We employ a state-of-the-art feature extraction model, specifically a You Only Look Once (YOLO) variant, to analyse self-constructed X-ray image datasets for contraband prohibit indicators. Concurrently, we generate a simulated express information dataset, encapsulating patterns characteristic of prohibit smuggling tactics. This data is then integrated with image-derived features through custom-designed prompts to train LLM classifiers. Our methodology is unique in its consideration of multi-modal data fusion, hypothesizing that the synthesis of visual and textual information will markedly improve detection accuracy over traditional single-dimension analysis. The experimental results demonstrate the efficacy of this approach, with the LLM classifiers outperforming standard methods in accurately identifying prohibit packages. The study not only provides a new perspective on the application of LLMs in contraband detection but also sets a precedent for future research in multimodal data exploitation for security and customs enforcement.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.