Graph-Attention Fusion with VAE Cross-Modal Mapping and Reinforcement-Learning Visualization for Real-Time AR
Abstract
In AR scenarios, the intelligent generation and visualization of multimodal perception information face challenges such as feature heterogeneity, insufficient semantic alignment, and unstable real-time performance. To address these issues, this study proposes a feature modeling method that integrates an Attention-GCN for multimodal fusion, a variational autoencoder (VAE) with geometric/temporal constraints for cross-modal mapping, and a reinforcement learning (PPO) driven optimization mechanism to form a "perception–generation–presentation–feedback" closed-loop system. Experiments are conducted on a self-built multimodal dataset of 28,000 sequences, with results evaluated on a held-out test set to ensure reliability. Baseline comparisons include a unimodal CNN and a heuristic fusion model under the same computational conditions. Results demonstrate that the proposed framework achieves an average delay of 1.42 ± 0.08 s, frame rate of 57 ± 1.5 fps, semantic alignment rate of 92.4% ± 1.1, and interaction interruption rate of 3.5% ± 0.4, outperforming baselines in efficiency, semantic consistency, and rendering stability. These findings highlight the framework’s feasibility for real-time multimodal interaction in AR scenarios and its scalability across mid-range devices.
Full Text:
PDFReferences
Ismail A W , Sunar M S .Multimodal Fusion: Gesture and Speech Input in Augmented Reality Environment[J].Advances in Intelligent Systems and Computing, 2015, 331:245-254.https://doi.org/10.1007/978-3-319-13153-5_24.
Yong J , Wei J , Lei X ,et al.Intervention and regulatory mechanism of multimodal fusion natural interactions on AR embodied cognition[J].Information Fusion, 2025,117.https://doi.org/10.1016/j.inffus.2024.102910.
Chen L, Zhao H, Shi C, et al. Enhancing multimodal perception and interaction: an augmented reality visualization system for complex decision making[J]. Systems, 2024,12(1):7.https://doi.org/10.3390/systems12010007.
Lee G-A, Sedlmair M, Schmalstieg D. Design patterns for situated visualization in augmented reality[J]. arXiv preprint, 2023,arXiv:2307.09157.https://doi.org/10.48550/arXiv.2307.09157.
Zollmann S , Langlotz T , Grasset R ,et al.Visualization Techniques in Augmented Reality: A Taxonomy, Methods and Patterns.[J].IEEE transactions on visualization and computer graphics, 2021, 27(9):3808-3825.https://doi.org/10.1109/TVCG.2020.2986247.
Zheng M , Lillis D , Campbell A G .Current state of the art and future directions: Augmented reality data visualization to support decision-making[J].Visual Informatics,2024,8(2):80-105.https://doi.org/10.1016/j.visinf.2024.05.001.
Friske MD. Integration of Augmented Reality and Mobile Robot Indoor SLAM for Enhanced Spatial Awareness[J]. arXiv preprint,2024,arXiv:2409.01915.https://doi.org/10.48550/arXiv.2409.01915.
Al-Tawil B. A review of visual SLAM for robotics: evolution, properties, and relevance to augmented reality[J]. Frontiers in Robotics and AI, 2024, 11: 1347985.https://doi.org/10.3389/frobt.2024.1347985.
Sheng X, Mao S, Yan Y, et al. Review on SLAM algorithms for augmented reality[J]. Displays, 2024, 84(2): 102806. https://doi.org/10.1016/j.displa.2024.102806.
Barros AM. A comprehensive survey of visual SLAM algorithms[J]. Robotics, 2022,11(1):24.https://doi.org/10.3390/robotics11010024.
Taketomi T, Uchiyama H, Ikeda S. Visual SLAM algorithms: a survey from 2010 to 2016[J]. IPSJ Transactions on Computer VisionandApplications,2017,9:1.https://doi.org/10.1186/s41074-017-0027-2.
Xu C , Kumaran R , Stier N ,et al.Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI[J]. IEEEISMAR2024.https://doi.org/10.1109/ISMAR62088.2024.00063.
Zhao F, Wang J, Li S, et al. Deep multimodal data fusion: a survey[J]. ACM Computing Surveys, 2024, 56(5): 1–36. https://doi.org/10.1145/3649447.
José Morano, Aresta G , Grechenig C ,et al.Deep Multimodal Fusion of Data With Heterogeneous Dimensionality via Projective Networks[J].Journal on Biomedical and Health Informatics (J-BHI),2024,28(4):12.https://doi.org/10.1109/JBHI.2024.3352970.
Ni J, Chen X, Yang Y, et al. Deep equilibrium multimodal fusion[J]. arXiv preprint, 2023, arXiv:2306.16645. https://doi.org/10.48550/arXiv.2306.16645.
Xue Z, Marculescu R. Dynamic multimodal fusion[J]. arXiv preprint, 2022,arXiv:2204.00102.https://doi.org/10.48550/arXiv.2204.00102.
Lee M, Billinghurst M, O’Brien D, et al. A usability study of multimodal input in an augmented reality multimodal interface[J]. Virtual Reality, 2013, 17(2–3): 119–135. https://doi.org/10.1007/s10055-013-0230-0.
Gao J , Li P , Chen Z ,et al.A Survey on Deep Learning for Multimodal Data Fusion[J].Neural Computation, 2020, 32(1):1-36.https://doi.org/10.1162/neco_a_01273.
Zhong R, Hu B, Feng Y, et al. Construction of human digital twin model based on multimodal data and its application in locomotion mode identification[J]. Chinese Journal of Mechanical Engineering, 2023, 36: 126. https://doi.org/10.1186/s10033-023-00951-0.
Cao C, Jiang Z, Wu H, et al. Study of deep multimodal information fusion–based digital twin method for gearbox fault diagnosis[J]. The International Journal of Advanced Manufacturing Technology, 2025,138:3529-3542.https://doi.org/10.1007/s00170-025-15673-x.
DOI: https://doi.org/10.31449/inf.v49i14.11191
This work is licensed under a Creative Commons Attribution 3.0 License.








