UPDF AI

Emotion Recognition from Text and Voice: A Multimodal AI Approach to Understanding Human Feelings

Mansara Sharma

2025 · DOI: 10.22214/ijraset.2025.73790
International Journal for Research in Applied Science and Engineering Technology · 0 Citations

TLDR

This project tries to answer the question of how to solve machines interpreting emotions by presenting a multimodal AI system that interprets emotions based on the simultaneous examination of speech and text content with NLP and speech analysis algorithms and a fusion-based deep learning system.

Abstract

Now, in our global online world, technology acts as an interface for communications within our lives, from intelligent assistants to internet customer support and even social networking. Humans, being proficient at interpreting the feelings that come with communication, are now done by machines. The coating of emotions conveying words in real time continues to be an enigma still unsolved. This project tries to answer the question of how to solve machines interpreting emotions. Presenting a multimodal AI system that interprets emotions based on the simultaneous examination of speech and text content with NLP and speech analysis algorithms and a fusion-based deep learning system is what our work revolves around. Using cutting-edge NLP and speech processing, we are creating systems that decode for content, as well as conduct and ethics. The backbone of the system for voice content decoding is the emotional cues of the language processing, or L through. Our model utilizes the transformer models Distil BERT and Roberta. Emotions are also present in the voice with MFCCs and speech as a series of frames via chroma spectrograms, which are processed by CNN-LSTM hybrids for emotion recognition from voice. More sophisticated models are also built for further fusion. Our performance on model as well as on individual streams enhances emotion detection systems. We accomplish in this project combining various models. The voice and the text are processed independently and different outputs are provided by each model.

Cited Papers
Citing Papers