Advanced Optimal Cross-Modal Fusion Mechanism for Audio-Video Based Artificial Emotion Recognition
Abstract
The advanced technology of artificial emotional intelligence has greatly contributed to multimodal emption recognition task. Emotion recognition has played a crucial role in many domains, like communication, elearning, mental healthcare, contextual awareness, and customer satisfaction. As real-time data continues to expand, addressing the problem of emotion recognition has become critical and complex. A key challenge lies in recognizing emotions from multimodal heterogeneous input sources, aligning extracted features, and developing robust emotion recognition models. In this study, we explore a cross-modal (audio and video modality) fusion mechanism for emotion recognition, effectively addressing the associated feature complexities. We have used 2D-CNN and 3D-CNN deep learning models for audio and video feature extractions and developed robust models for emotion recognition. This study emphasizes the importance of Compact Bilinear Gated Pooling (CBGP) cross-modal fusion mechanism and highlights the contribution of fusing the features from audio and video modalities for emotion recognition. It also discusses the working principle and comparison performance with other peer cross-modal fusion techniques such as FBP and CBP. The performance of advanced cross-modal fusion is compared to baseline traditional cross-modal fusion mechanisms including EF-LSTM, LF-LSTM, Graph-MFN, hybrid fusion and transformer model based fusion mechanisms such as, attention fusion and transformer fusion. This experiment is performed on benchmark datasets CMU-MOSEI and achieves an accuracy of 80.3%, F1-score of 79.2%, and MAE of 54.2%.
Full Text:
PDFDOI: https://doi.org/10.31449/inf.v49i12.7392

This work is licensed under a Creative Commons Attribution 3.0 License.