Deep Dive Into Music Videos: Hierarchical Emotion Recognition With Rich Audio and Visual Features
Y. R. Pandeya,Ashim Gelal,Harish Chandra Bhandari,Priya Pandey
TLDR
The convolutional neural network models for 1D, 2D, and 3D audio and video processing outperformed existing methods in various scenarios while requiring minimal training parameters.
Abstract
This study aimed to address the challenges of cultural diversity and limited labeled data for music emotion classification. We introduced a benchmark dataset for music videos, featuring hierarchical emotion labels ranging from coarse to fine levels. We considered six established audio and video features, including geometric, spectral, harmonic, temporal, spatiotemporal, and visual attributes, for music emotion classification. We proposed hierarchical music video emotion classification networks and established baseline results using our dataset. Additionally, we presented a pipeline for audio processing using graph neural networks with reduced edge connections. Our convolutional neural network models for 1D, 2D, and 3D audio and video processing outperformed existing methods in various scenarios while requiring minimal training parameters. The study utilizes both quantitative measures and visual analysis to evaluate the results.
