Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language
Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language
Dicong Wang,Qilong Wang,Qinghua Hu,Kaijun Wu
TLDR
A novel dual-stream multimodal VAD network, which integrates coarse-grained and fine-grained streams combining video, audio, and text modalities is proposed, which supports the development of highly robust intelligent monitoring systems and promotes the potential applications of multimodal VAD across industrial monitoring, public safety, smart cities, and so on.
Abstract
The deep learning-based anomaly detection methods using visual sensors generally rely on a single modality or variants as raw signal inputs, which severely limits expressiveness and adaptability. The evolution of multimodal and visual-language pretrained models is shaping new possibilities in video anomaly detection (VAD). So, how to efficiently leverage them to achieve reliable multimodal VAD presents a significant challenge worth investigating. In this work, we propose a novel dual-stream multimodal VAD network, which integrates coarse-grained and fine-grained streams combining video, audio, and text modalities. First, in the coarse-grained stream, we perform cross-modal fusion of audio features with temporally modeled visual features, utilizing contrastive optimization to achieve more accurate coarse-grained results. In the fine-grained stream, we constructed abnormal-aware context prompts (ACPs) by integrating visual information and prior knowledge related to anomalous events into the text modality. Through the “coarse-support-fine” strategy, we further enhanced the model’s ability to discriminate fine-grained anomalies. Our method achieved optimal performance in experiments on two large-scale anomaly datasets, demonstrating its effectiveness and superiority. It supports the development of highly robust intelligent monitoring systems and promotes the potential applications of multimodal VAD across industrial monitoring, public safety, smart cities, and so on.
