Cross-Modal Transformer with Dynamic Attention Fusion for Emotion Recognition in Music via Audio-Lyrics Alignment

XiaoFeng Li

doi:10.31449/inf.v49i28.11516

Cross-Modal Transformer with Dynamic Attention Fusion for Emotion Recognition in Music via Audio-Lyrics Alignment

XiaoFeng Li

Abstract

Emotion recognition from multimodal signals remains a challenging task due to annotation subjectivity and heterogeneous feature spaces. To address these issues, this study proposes a cross-modal Transformer architecture with dynamic attention fusion for robust emotion classification. Raw acoustic signals are converted into time–frequency spectrograms, from which hierarchical features are extracted via a deep convolutional network. In parallel, textual data (e.g., lyrics or aligned semantic content) are encoded with a pre-trained language model to obtain context-aware embeddings. A cross-modal attention mechanism embedded in the Transformer encoder adaptively models inter-modal associations, enabling semantically guided acoustic representation learning. The fused joint representation is aggregated through pooling and passed to a fully connected classifier, yielding multi-category emotion probabilities. Experimental evaluations demonstrate that the proposed Transformer model outperforms CNN, CRNN, and traditional Transformer models in noisy conditions (average accuracy = 0.58; macro F1 = 0.55 at 0 dB SNR) and exhibits superior generalization capabilities across datasets (AUC = 0.832–0.887). Furthermore, with only 30% labeled data, the model maintains reliable emotion continuity (CCC = 0.635; ICC = 0.584), highlighting its effectiveness in low-resource scenarios. These results confirm the potential of cross-modal Transformer fusion for advancing emotion-aware intelligent systems in multimodal perception applications.

Full Text:

PDF PDF

References

Gómez-Cañón JS, Cano E, Eerola T, et al. Music emotion recognition: Toward new, robust standards in personalized and context-sensitive applications[J]. IEEE Signal Processing Magazine, 2021, 38(6): 106-114.

Gómez-Cañón JS, Gutiérrez-Páez N, Porcaro L, et al. TROMPA-MER: an open dataset for personalized music emotion recognition[J]. Journal of Intelligent Information Systems, 2023, 60(2): 549-570.

Hizlisoy S, Yildirim S, Tufekci Z. Music emotion recognition using convolutional long short term memory deep neural networks[J]. Engineering Science and Technology, an International Journal, 2021, 24(3): 760-767.

Grekow J. Music emotion recognition using recurrent neural networks and pretrained models[J]. Journal of Intelligent Information Systems, 2021, 57(3): 531-546.

Assuncao WG, Piccolo LSG, Zaina LA M. Considering emotions and contextual factors in music recommendation: a systematic literature review[J]. Multimedia Tools and Applications, 2022, 81(6): 8367-8407.

Chaturvedi V, Kaur AB, Varshney V, et al. Music mood and human emotion recognition based on physiological signals: a systematic review[J]. Multimedia Systems, 2022, 28(1): 21-44.

Garg A, Chaturvedi V, Kaur AB, et al. Machine learning model for mapping of music mood and human emotion based on physiological signals[J]. Multimedia Tools and Applications, 2022, 81(4): 5137-5177.

Li JW, Barma S, Mak PU, et al. Single-channel selection for EEG-based emotion recognition using brain rhythm sequencing[J]. IEEE journal of biomedical and health informatics, 2022, 26(6): 2493-2503.

Athavle M, Mudale D, Shrivastav U, et al. Music recommendation based on face emotion recognition[J]. Journal of Informatics Electrical and Electronics Engineering (JIEEE), 2021, 2(2): 1-11.

Cunningham S, Ridley H, Weinel J, et al. Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks[J]. Personal and Ubiquitous Computing, 2021, 25(4): 637-650.

Yin G, Sun S, Yu D, et al. A multimodal framework for large-scale emotion recognition by fusing music and electrodermal activity signals[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(3): 1-23.

Abdullah SMSA, Ameen SYA, Sadeeq MAM, et al. Multimodal emotion recognition using deep learning[J]. Journal of Applied Science and Technology Trends, 2021, 2(01): 73-79.

Kamble K, Sengupta J. A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals[J]. Multimedia Tools and Applications, 2023, 82(18): 27269-27304.

Zhao S, Jia G, Yang J, et al. Emotion recognition from multiple modalities: Fundamentals and methodologies[J]. IEEE Signal Processing Magazine, 2021, 38(6): 59-73.

Pandeya YR, Lee J. Deep learning-based late fusion of multimodal information for emotion classification of music video[J]. Multimedia Tools and Applications, 2021, 80(2): 2887-2905.

Li X, Zhang Y, Tiwari P, et al. EEG based emotion recognition: A tutorial and review[J]. ACM Computing Surveys, 2022, 55(4): 1-57.

Pawar MD, Kokate R D. Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients[J]. Multimedia Tools and Applications, 2021, 80(10): 15563-15587.

Zihan D, Alam N, Islam M M. Deep Learning-Driven Music Emotion Recognition: CNN-BiLSTM Networks with Spatial-Temporal Attention[J]. Journal of Platform Technology, 2025, 13(1): 16-30.

Lin Z, Wang Z, Zhu Y, et al. Text sentiment detection and classification based on integrated learning algorithm[J]. Applied Science and Engineering Journal for Advanced Research, 2024, 3(3): 27-33.

Houssein EH, Hammad A, Ali A A. Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review[J]. Neural Computing and Applications, 2022, 34(15): 12527-12557.

Jingjing W, Ru H. Music emotion recognition based on the broad and deep learning network[J]. Journal of East China University of Science and Technology, 2022, 48(3): 373-380.

Bakariya B, Singh A, Singh H, et al. Facial emotion recognition and music recommendation system using CNN-based deep learning techniques[J]. Evolving Systems, 2024, 15(2): 641-658.

Yadav SP, Zaidi S, Mishra A, et al. Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN)[J]. Archives of Computational Methods in Engineering, 2022, 29(3): 1753-1770.

Topic A, Russo M. Emotion recognition based on EEG feature maps through deep learning network[J]. Engineering Science and Technology, an International Journal, 2021, 24(6): 1442-1454.

Viñán-Ludeña MS, de Campos L M. Discovering a tourism destination with social media data: BERT-based sentiment analysis[J]. Journal of Hospitality and Tourism Technology, 2022, 13(5): 907-921.

Huang F, Li X, Yuan C, et al. Attention-emotion-enhanced convolutional LSTM for sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2021, 33(9): 4332-4345.

Zhang Y, Zhang Y, Wang S. An attention-based hybrid deep learning model for EEG emotion recognition[J]. Signal, image and video processing, 2023, 17(5): 2305-2313.

Yi Y, Tian Y, He C, et al. DBT: multimodal emotion recognition based on dual-branch transformer: Y. Yi et al[J]. The Journal of Supercomputing, 2023, 79(8): 8611-8633.

Wu Y, Daoudi M, Amad A. Transformer-based self-supervised multimodal representation learning for wearable emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 15(1): 157-172.

Arumugam L, Arumugam S, Chidambaram P, et al. A multi-model deep learning approach for human emotion recognition[J]. Cognitive Neurodynamics, 2025, 19(1): 123.

Khan M, Gueaieb W, El Saddik A, et al. MSER: Multimodal speech emotion recognition using cross-attention with deep fusion[J]. Expert Systems with Applications, 2024, 245: 122946.

Deng Z, Lu Y, Liao J, et al. Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion[J]. arXiv preprint arXiv:2507.21395, 2025.

Liu Y, Zhang H, Zhan Y, et al. Noise-resistant multimodal transformer for emotion recognition[J]. International Journal of Computer Vision, 2024: 1-21.

Wafa A A, Eldefrawi M M, Farhan M S. Advancing multimodal emotion recognition in big data through prompt engineering and deep adaptive learning[J]. Journal of Big Data, 2025, 12(1): 1-62.

Savchenko A, Savchenko L. Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 5778-5788.

Goel A, Singh H, Singh A. Emotion-Aware Speech Translation: A Review[C]//2025 International Conference on Intelligent Control, Computing and Communications (IC3). IEEE, 2025: 533-538.

DOI: https://doi.org/10.31449/inf.v49i28.11516

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me