Paper
8 November 2024 Video understanding with image, audio, and text
Weiwei Wen, Lingzhi Liao
Author Affiliations +
Proceedings Volume 13416, Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024); 134162Q (2024) https://doi.org/10.1117/12.3049519
Event: 2024 4th International Conference on Advanced Algorithms and Neural Networks, 2024, Qingdao, China
Abstract
With the explosive growth of streaming video data, describing or understanding these videos has become an interesting topic within the international academic community. However, existing methods have ignored the important information among the image, audio, and text, resulting in insufficient understanding of the video. In this paper, we propose a novel video understanding algorithm that incorporates the above neglected information. Firstly, this method combines speech recognition and a Large Language Model(LLM) to obtain the detailed textual descriptions of the video. Secondly, the image and textual descriptions are combined to obtain video keyframes. Finally, the textual descriptions and keyframes are concatenated to gain pivotal video understanding results. Extensive experiments have shown the superiority of the proposed method.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Weiwei Wen and Lingzhi Liao "Video understanding with image, audio, and text", Proc. SPIE 13416, Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024), 134162Q (8 November 2024); https://doi.org/10.1117/12.3049519
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Speech recognition

Feature extraction

Image processing

Video processing

RELATED CONTENT


Back to Top