Enhanced Dynamic Interactive Multi-View Memory Network for Utterance-Level Sentiment Recognition
Enhanced Dynamic Interactive Multi-View Memory Network for Utterance-Level Sentiment Recognition
Zhiyan Chen,Libo Dong
Abstract
Multimodal models are one of the keys to enhancing emotion recognition in conversation. Videos, audios, and texts in the dialogue can help analyze and improve the emotional expression of the interlocutors. Existing emotion recognition models often struggle to effectively capture complex contextual information and nuanced emotional tones, making it difficult to leverage dialogue information effectively. How to effectively integrate multimodal features to further understand emotional expressions in dialogues remains a challenge. To address the aforementioned issues, an Enhanced Dynamic Interactive Multi-view Memory Network for Utterance-level Sentiment Recognition (EDIMMN-USR) is proposed. Firstly, dynamically capture cross-modal interactive information through a multi-view attention network. Secondly, hierarchical gated recurrent units integrated with feature fusion and Temporal Convolutional Networks are utilized to learn long-range dependencies and complex contextual emotions for individual interlocutors. Finally, bidirectional gated recurrent units and a memory network are used to model the global dialogue, capturing complex contextual information across multiple turns of dialogue. Experimental results on IEMOCAP and MELD datasets showed that the accuracy of EDIMMN-USR was improved by 1.87% and 4.7% respectively.
