UPDF AI

Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language

Dicong Wang,Qilong Wang,Qinghua Hu,Kaijun Wu

2025 · DOI: 10.1109/TIM.2025.3578702
IEEE Transactions on Instrumentation and Measurement · 0 Citations

TLDR

A novel dual-stream multimodal VAD network, which integrates coarse-grained and fine-grained streams combining video, audio, and text modalities is proposed, which supports the development of highly robust intelligent monitoring systems and promotes the potential applications of multimodal VAD across industrial monitoring, public safety, smart cities, and so on.

Abstract

The deep learning-based anomaly detection methods using visual sensors generally rely on a single modality or variants as raw signal inputs, which severely limits expressiveness and adaptability. The evolution of multimodal and visual-language pretrained models is shaping new possibilities in video anomaly detection (VAD). So, how to efficiently leverage them to achieve reliable multimodal VAD presents a significant challenge worth investigating. In this work, we propose a novel dual-stream multimodal VAD network, which integrates coarse-grained and fine-grained streams combining video, audio, and text modalities. First, in the coarse-grained stream, we perform cross-modal fusion of audio features with temporally modeled visual features, utilizing contrastive optimization to achieve more accurate coarse-grained results. In the fine-grained stream, we constructed abnormal-aware context prompts (ACPs) by integrating visual information and prior knowledge related to anomalous events into the text modality. Through the “coarse-support-fine” strategy, we further enhanced the model’s ability to discriminate fine-grained anomalies. Our method achieved optimal performance in experiments on two large-scale anomaly datasets, demonstrating its effectiveness and superiority. It supports the development of highly robust intelligent monitoring systems and promotes the potential applications of multimodal VAD across industrial monitoring, public safety, smart cities, and so on.