InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang,Kunchang Li,15 Authors,Limin Wang
TLDR
This work scales both data and model size for the InternVideo2, a model that outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts.
Abstract
data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2 . Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts.
