A Multi-modal Diffusion Model-Based Digital Twin Framework for Stadium Management via IoT Data Fusion
Abstract
This study proposes a sports venue digital twin system construction method that integrates multi-modal diffusion model and Internet of Things data, aiming to achieve high-precision modeling and intelligent prediction of venue status. In terms of system architecture, the framework consists of four layers—perception, data processing, modeling, and application—forming a closed-loop of perception–fusion–modeling–feedback. The experimental setup involved a multimodal dataset comprising over 50,000 high-resolution monitoring images, 8,000+ daily sensor records (temperature, humidity, CO₂, light, and noise), 15,000 text logs, and crowd/environmental audio spectrograms, collected with a sensor network deployed at 1–5 s intervals. By integrating these multimodal streams, the diffusion model achieved semantic fusion and predictive reconstruction with high robustness. For benchmarking, our method was compared against CNN, GNN, and SVM baselines, as well as Transformer-based multimodal fusion and Graph Attention Networks (GATs). In terms of performance, the multimodal diffusion model reduced image, speech, and text processing times from 122 ms, 96 ms, and 78 ms of CNN-based models to 78 ms, 65 ms, and 49 ms, with overall latency reduced by 35.1%. The overall sensor data integrity rate exceeded 98% (pedestrian flow sensor at 99.53%). Regarding digital twin modeling accuracy, the spatial restoration accuracy reached 96.3%, motion trajectory simulation 94.7%, and environmental prediction 93.5%, with an average accuracy of 94.8%, consistently outperforming baseline approaches. The multi-modal diffusion model constructed in this research institute and the digital twin system collaborated with IoT perform well in terms of perception fusion, scene prediction and interaction performance, providing a strong theoretical basis and engineering support for the intelligent operation of sports venues.
Full Text:
PDFDOI: https://doi.org/10.31449/inf.v49i28.10300
This work is licensed under a Creative Commons Attribution 3.0 License.








