Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks

TLDR

The proposed MTG-ERC model effectively combines global and local conversational emotion features and enhances intra-modal and inter-modal emotional interactions and highlights the model’s strong performance indicators and validates its effectiveness in comparison to existing models.

Resumo

To comprehensively capture conversational emotion information within and between modalities, address the challenge of global and local feature modelling in conversation, and enhance the accuracy of multimodal conversation emotion recognition, we present a model called Multimodal Transformer and GNN for Emotion Recognition in Conversations (MTG-ERC). The model incorporates a multi-level Transformer fusion module that employs multi-head self-attention and cross-modal attention mechanisms to effectively capture interaction patterns within and between modalities. To address the shortcomings of attention-mechanism-based models in capturing short-term dependencies, we introduce a directed multi-relational graph fusion module, which employs directed graphs and multiple relation types to achieve efficient multimodal information fusion and to model short-term, speaker-dependent emotional shifts. By integrating the outputs of these two modules, the MTG-ERC model effectively combines global and local conversational emotion features and enhances intra-modal and inter-modal emotional interactions. The proposed model shows consistent improvements (around 1% absolute) in both accuracy and weighted F1 on the IEMOCAP and MELD datasets when compared with other baseline models. This highlights the model’s strong performance indicators and validates its effectiveness in comparison to existing models.