Interpretable Machine Learning Framework for Ozone Concentration Prediction Using TAP Data and SHAP Analysis in China
Abstract
This study aims to enhance the monitoring and prediction capabilities of ozone pollution and support precise environmental governance and improvement of atmospheric quality. It constructs and compares multiple ozone concentration prediction models, including Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF), based on machine learning methods. The study uses national online observation data from Tracking Air Pollution in China (TAP) spanning 2015–2025, covering 34 provinces. Before modeling, the data are subjected to missing value imputation, outlier removal, and normalization. Combined with Shapley Additive Explanations (SHAP) values, the study conducts model interpretability analysis to deeply reveal the key driving factors affecting ozone formation and their regional differences. The results show that the national annual average ozone concentration increases from 105.6 μg/m³ in 2015 to 130.2 μg/m³ in 2019, with an increase of 23.3%. The peak concentration in the Beijing-Tianjin-Hebei region in 2019 reached 182.8 μg/m³, exceeding the national limit by 114%. Ozone concentration decreased in 2020 due to the impact of the epidemic, but it was still 24.6 μg/m³ higher in 2024 than in 2015. In terms of model prediction performance, XGBoost performs the best nationwide and in all major regions. At the national level, its Mean Absolute Error (MAE) is 11.3 μg/m³, Root Mean Squared Error (RMSE) is 15.6 μg/m³, R² is 0.882, and Nash-Sutcliffe Efficiency coefficient (NSE) is 0.88. SHAP analysis indicates that day of year, temperature, and sunshine duration are the main driving factors for changes in national ozone concentration. NO₂ contributes significantly in the Beijing-Tianjin-Hebei region and the Fenwei Plain, while the effects of temperature and sunshine are the strongest in the Pearl River Delta region. This study enriches the understanding of the spatiotemporal dynamics and formation mechanisms of ozone pollution, and provides solid data support and theoretical basis for the scientific formulation of regional pollution prevention and control strategies.
Full Text:
PDFReferences
Kong L, Song M, Li X, et al. Analysis of China’s PM2. 5 and ozone coordinated control strategy based on the observation data from 2015 to 2020. Journal of Environmental Sciences, 2024, 138(1): 385-394.
Li Z, Bi J, Liu Y, et al. Forecasting O3 and NO2 concentrations with spatiotemporally continuous coverage in southeastern China using a Machine learning approach. Environment International, 2025, 195(1): 109249.
Carbo-Bustinza N, Iftikhar H, Belmonte M, et al. Short-term forecasting of Ozone concentration in metropolitan Lima using hybrid combinations of time series models. Applied Sciences, 2023, 13(18): 10514.
Rahman A, Nasher N M R. Forecasting hourly ozone concentration using functional time series modelâA case study in the coastal area of bangladesh. Environmental Modeling & Assessment, 2024, 29(1): 125-134.
Xie J, Tang X, Zheng F, et al. Improvement of the ozone forecast over Beijing through combining the chemical transport model with multiple machine learning methods. Atmospheric Pollution Research, 2024, 15(8): 102184.
Ghahremanloo M, Choi Y, Lops Y. Deep learning mapping of surface MDA8 ozone: The impact of predictor variables on ozone levels over the contiguous United States. Environmental Pollution, 2023, 326(1): 121508.
Lasry F, Coll I, Fayet S, et al. Short-term measures for the control of ozone peaks: expertise from CTM simulations. Journal of atmospheric chemistry, 2007, 57(1): 107-134.
Li J, Jang J, Zhu Y, et al. Development of a recurrent spatiotemporal deep-learning method coupled with data fusion for correction of hourly ozone forecasts. Environmental Pollution, 2023, 335(1): 122291.
Liu Z, Lu Z, Zhu W, et al. Comparison of machine learning methods for predicting ground-level ozone pollution in Beijing. Frontiers in Environmental Science, 2025, 13(1): 1561794.
Tang B, Stanier C O, Carmichael G R, et al. Ozone, nitrogen dioxide, and PM2. 5 estimation from observation-model machine learning fusion over S. Korea: Influence of observation density, chemical transport model resolution, and geostationary remotely sensed AOD. Atmospheric Environment, 2024, 331(1): 120603.
Marvin D, Nespoli L, Strepparava D, et al. A data-driven approach to forecasting ground-level ozone concentration. International Journal of Forecasting, 2022, 38(3): 970-987.
Gagliardi R V, Andenna C. Exploring the Influencing Factors of Surface Ozone Variability by Explainable Machine Learning: A Case Study in the Basilicata Region (Southern Italy). Atmosphere, 2025, 16(5): 491.
Langer S, Weschler C J, Beko G, et al. Squalene depletion in skin following human exposure to ozone under controlled chamber conditions. Environmental Science & Technology, 2024, 58(15): 6693-6703.
Chipperfield M P, Bekki S. Opinion: Stratospheric ozoneâ depletion, recovery and new challenges. Atmospheric Chemistry and Physics, 2024, 24(4): 2783-2802.
Ferreira J P, Huang Z, Nomura K, et al. Potential ozone depletion from satellite demise during atmospheric reentry in the era of mega constellations. Geophysical Research Letters, 2024, 51(11): e2024GL109280.
Chu W, Li H, Ji Y, et al. Research on ozone formation sensitivity based on observational methods: Development history, methodology, and application and prospects in China. Journal of Environmental Sciences, 2024, 138(1): 543-560.
Wang X, Shao T, Qin J, et al. Promotion effect of micro-hole in dielectric on ozone generation of dielectric barrier discharge. Ozone: Science & Engineering, 2024, 46(4): 345-354.
Li Y, Wu Z, Ji Y, et al. Comparison of the ozone formation mechanisms and VOCs apportionment in different ozone pollution episodes in urban Beijing in 2019 and 2020: Insights for ozone pollution control strategies. Science of The Total Environment, 2024, 908(1): 168332.
Geng G, Xiao Q, Liu S, et al. Tracking air pollution in China: near real-time PM2. 5 retrievals from multisource data fusion. Environmental Science & Technology, 2021, 55(17): 12106-12115.
Guo Q, He Z, Wang Z. The characteristics of air quality changes in Hohhot City in China and their relationship with meteorological and socio-economic factors. Aerosol and Air Quality Research, 2024, 24(5): 230274.
She Y, Li J, Lyu X, et al. Current status of model predictions of volatile organic compounds and impacts on surface ozone predictions during summer in China. Atmospheric Chemistry and Physics, 2024, 24(1): 219-233.
JimÃnez-Navarro M J, MartÃnez-Ballesteros M, MartÃnez-Ãlvarez F, et al. Explaining deep learning models for ozone pollution prediction via embedded feature selection. Applied Soft Computing, 2024, 157(1): 111504.
Ning Z, Gao S, Gu Z, et al. Prediction and explanation for ozone variability using cross-stacked ensemble learning model. Science of The Total Environment, 2024, 935(1): 173382.
Hosseinpour F, Kumar N, Tran T, et al. Using machine learning to improve the estimate of US background ozone. Atmospheric Environment, 2024, 316(1): 120145.
Yao L, Han Y, Qi X, et al. Determination of major drive of ozone formation and improvement of O3 prediction in typical North China Plain based on interpretable random forest model. Science of The Total Environment, 2024, 934(1): 173193.
Cheng M, Fang F, Navon I M, et al. Assessing uncertainty and heterogeneity in machine learning-based spatiotemporal ozone prediction in Beijing-Tianjin-Hebei region in China. Science of the Total Environment, 2023, 881(1): 163146.
Yao T, Lu S, Wang Y, et al. Revealing the drivers of surface ozone pollution by explainable machine learning and satellite observations in Hangzhou Bay, China. Journal of Cleaner Production, 2024, 440(2): 140938.
Zhao B, Wang S, Hao J. Challenges and perspectives of air pollution control in China. Frontiers of Environmental Science & Engineering, 2024, 18(6): 68.
Nath S J, Girach I A, Harithasree S, et al. Urban ozone variability using automated machine learning: inference from different feature importance schemes. Environmental Monitoring and Assessment, 2024, 196(4): 393.
Han L, Zhao J, Gao Y, et al. Prediction and evaluation of spatial distributions of ozone and urban heat island using a machine learning modified land use regression method. Sustainable Cities and Society, 2022, 78(1): 103643.
DOI: https://doi.org/10.31449/inf.v49i24.10493
This work is licensed under a Creative Commons Attribution 3.0 License.








