Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK.
Department of Earth and Environmental Sciences, Faculty of Science, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait.
Int J Environ Res Public Health. 2021 Feb 2;18(3):1333. doi: 10.3390/ijerph18031333.
In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.
在环境研究中,缺失数据通常是统计建模面临的挑战。本文采用多重插补(MI)方法,针对空气质量数据集,介绍了一些处理缺失值的高级技术。本文应用了 MCAR、MAR 和 NMAR 缺失数据技术来处理数据集。考虑了五个缺失数据级别:5%、10%、20%、30%和 40%。本文使用的插补方法是迭代插补方法 missForest,它与随机森林方法有关。空气质量数据集来自科威特的五个监测站,汇总为每日数据。对所有污染物数据进行对数转换,以归一化其分布并最小化偏度。我们发现,NO2(18.4%)、CO(18.5%)、PM10(57.4%)、SO2(19.0%)和 O3(18.2%)数据的缺失值水平较高。气候数据(即空气温度、相对湿度、风向和风速)被用作更好估计的控制变量。结果表明,MAR 技术的 RMSE 和 MAE 最低。我们得出结论,使用 missForest 方法的 MI 在估计缺失值方面具有很高的准确性。与其他插补方法相比,missForest 的插补误差(RMSE 和 MAE)最低,因此可用于分析空气质量数据。