CMMF and STAM-FNet: Multimodal Fusion Architectures for Complex Scene Understanding in Dynamic Environments
Abstract
Multimodal perception has emerged as a vital strategy for understanding complex and dynamic environments, where traditional unimodal approaches fail to handle data heterogeneity and occlusion. This paper proposes two multimodal fusion frameworks—CMMF (Cross-Modal Matching Fusion) and STAM-FNet (Spatio-Temporal Attention Multimodal Fusion Network)—to address structural and temporal challenges in complex scene understanding. The CMMF model adopts a three-stage architecture with cross-modal semantic alignment and dynamic weighting, while STAM-FNet introduces spatio-temporal attention layers and 3D convolutions to enhance feature discrimination in dynamic environments. Experiments are conducted on a dataset of 120000 samples covering three application scenarios: urban monitoring, indoor interaction, and transportation hubs. Evaluation is based on standardized metrics including Top-1 Accuracy, F1-score, AUC, Modal Gain Index, and Inference Delay. Compared to SOTA baselines such as ResNet50, Two-Stream Transformer, and MMBT, STAM-FNet achieves up to 15.8% improvement in accuracy and 20% robustness gain under high-occlusion conditions. CMMF maintains superior performance in static tasks while preserving low parameter count (24.3M). This work demonstrates the effectiveness of adaptive multimodal fusion in improving accuracy, efficiency, and fault tolerance in real-world perception systems.
Full Text:
PDFReferences
Zhang JP, Geng Q, Jin J. EKLI-Attention: An integrated attention mechanism for classifying citizen requests in government-citizen interactions. Inf Process Manag. 2025 Nov; 62(6):104237. doi:10.1016/j.ipm.2025.104237.
Choi YM, Chiu TY, Ferreira J, Golomb JD. Maintaining visual stability in naturalistic scenes: The roles of trans-saccadic memory and default assumptions. Cognition. 2025 Sep; 262:106165. doi:10.1016/j.cognition.2025.106165.
Zhang LT, Zhang XM, Han LF, Yu ZL, Liu Y, Li ZJ. Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization. Inf Process Manag. 2024 Jul; 61(4):103693. doi:10.1016/j.ipm.2024.103693.
Lu Q, Sun X, Gao ZZZ, Long YF, Feng J, Zhang H. Coordinated-joint translation fusion framework with sentiment-interactive graph convolutional networks for multimodal sentiment analysis. Inf Process Manag. 2024 Jan; 61(1):103538. doi:10.1016/j.ipm.2023.103538.
Man KW. Multimodal Data Fusion to Detect Preknowledge Test-Taking Behavior Using Machine Learning. Educ Psychol Meas. 2024 Aug; 84(4):753-779. doi:10.1177/00131644231193625.
Wang WD, Zhang HY, Zhang ZB. Research on Emotion Recognition Method of Flight Training Based on Multimodal Fusion. Int J Hum Comput Interact. 2024 Oct 17; 40(20):6478-6491. doi:10.1080/10447318.2023.2254644.
Yang C, Gan XL, Peng AT, Yuan XY. ResNet Based on Multi-Feature Attention Mechanism for Sound Classification in Noisy Environments. Sustainability. 2023 Jul; 15(14):10762. doi:10.3390/su151410762.
Tang JJ, Hou M, Jin XY, Zhang JH, Zhao QB, Kong WZ. Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. Systems. 2023 Jan; 11(1):44. doi:10.3390/systems11010044.
Lin H, Zhang PL, Ling JD, Yang ZG, Lee LK, Liu WY. PS-Mixer: A Polar-Vector and Strength-Vector Mixer Model for Multimodal Sentiment Analysis. Inf Process Manag. 2023 Mar; 60(2):103229. doi:10.1016/j.ipm.2022.103229.
Luo ZZ, Zheng CY, Gong J, Chen SL, Luo Y, Yi YG. 3DLIM: Intelligent analysis of students' learning interest by using multimodal fusion technology. Educ Inf Technol. 2023 Jul; 28(7):7975-7995. doi:10.1007/s10639-022-11485-8.
Chen L, Zhang SP, Wang HH, Ma PJ, Ma ZW, Duan GH. Deep USRNet Reconstruction Method Based on Combined Attention Mechanism. Sustainability. 2022 Nov; 14(21):14151. doi:10.3390/su142114151.
Zhao C, Liu RJ, Su B, Zhao L, Han ZY, Zheng W. Traffic Flow Prediction with Attention Mechanism Based on TS-NAS. Sustainability. 2022 Oct; 14(19):12232. doi:10.3390/su141912232.
Zhao L, Zhang YY, Zhang CZ. Does attention mechanism possess the feature of human reading? A perspective of sentiment classification task. Aslib J Inf Manag. 2023 Jan 6; 75(1):20-43. doi:10.1108/AJIM-12-2021-0385.
Leroy A, Spotorno S, Faure S. Processing of complex visual scenes: Between semantic and emotion understanding. Annee Psychol. 2021 Mar; 121(1):101-139.
Zhang H, Anderson NC, Miller KF. Refixation Patterns of Mind-Wandering During Real-World Scene Perception. J Exp Psychol Hum Percept Perform. 2021 Jan; 47(1):36-52. doi:10.1037/xhp0000877.
Maier A, Tsuchiya N. Growing evidence for separate neural mechanisms for attention and consciousness. Atten Percept Psychophys. 2021 Feb; 83(2):558-576. doi:10.3758/s13414-020-02146-4.
der Nederlanden CMV, Zaragoza C, Rubio-Garcia A, Clarkson E, Snyder JS. Change detection in complex auditory scenes is predicted by auditory memory, pitch perception, and years of musical training. Psychol Res. 2020 Apr; 84(3):585-601. doi:10.1007/s00426-018-1072-x.
Liu MF, Li LJ, Hu HJ, Guan WL, Tian J. Image caption generation with dual attention mechanism. Inf Process Manag. 2020 Mar; 57(2):102178. doi:10.1016/j.ipm.2019.102178.
Meir A, Oron-Gilad T. Understanding complex traffic road scenes: The case of child-pedestrians' hazard perception. J Safety Res. 2020 Feb; 72:111-126. doi:10.1016/j.jsr.2019.12.014.
Luo ZZ, Chen JY, Wang GS, Liao MY. A three-dimensional model of student interest during learning using multimodal fusion with natural sensing technology. Interact Learn Environ. 2022 Jul 1; 30(6):1117-1130. doi:10.1080/10494820.2019.1710852.
Loschky LC, Larson AM, Smith TJ, Magliano JP. The Scene Perception & Event Comprehension Theory (SPECT) Applied to Visual Narratives. Top Cogn Sci. 2020 Jan; 12(1):311-351. doi:10.1111/tops.12455.
DOI: https://doi.org/10.31449/inf.v49i9.9830
This work is licensed under a Creative Commons Attribution 3.0 License.








