The Mel and Enid Zuckerman College of Public Health, The University of Arizona, 1295 N. Martin Ave, Tucson, AZ 85724, USA.
The Mel and Enid Zuckerman College of Public Health, The University of Arizona, 1295 N. Martin Ave, Tucson, AZ 85724, USA.
Sci Total Environ. 2020 Aug 15;730:139140. doi: 10.1016/j.scitotenv.2020.139140. Epub 2020 May 3.
Monitoring of environmental contaminants is a critical part of exposure sciences research and public health practice. Missing data are often encountered when performing short-term monitoring (<24 h) of air pollutants with real-time monitors, especially in resource-limited areas. Approaches for handling consecutive periods of missing and incomplete data in this context remain unclear. Our aim is to evaluate existing imputation methods for handling missing data for real-time monitors operating for short durations. In a current field-study, realtime PM2.5 monitors were placed outside of 20 households and ran for 24-hours. Missing data was simulated in these households at four consecutive periods of missingness (20%, 40%, 60%, 80%). Univariate (Mean, Median, Last Observation Carried Forward, Kalman Filter, Random, Markov) and multivariate time-series (Predictive Mean Matching, Row Mean Method) methods were used to impute missing concentrations, and performance was evaluated using five error metrics (Absolute Bias, Percent Absolute Error in Means, R2 Coefficient of Determination, Root Mean Square Error, Mean Absolute Error). Univariate methods of Markov, random, and mean imputations were the best performing methods that yielded 24-hour mean concentrations with the lowest error and highest R2 values across all levels of missingness. When evaluating error metrics minute-by-minute, Kalman filters, median, and Markov methods performed well at low levels of missingness (20-40%). However, at higher levels of missingness (60-80%), Markov, random, median, and mean imputation performed best on average. Multivariate methods were the worst performing imputation methods across all levels of missingness. Imputation using univariate methods may provide a reasonable solution to addressing missing data for short-term monitoring of air pollutants, especially in resource-limited areas. Further efforts are needed to evaluate imputation methods that are generalizable across a diverse range of study environments.
环境污染物监测是暴露科学研究和公共卫生实践的重要组成部分。使用实时监测器进行空气污染物的短期监测(<24 小时)时,经常会遇到缺失数据,尤其是在资源有限的地区。在这种情况下,处理连续缺失和不完整数据的方法仍不清楚。我们的目的是评估现有的用于处理实时监测器短期运行时缺失数据的插补方法。在当前的现场研究中,实时 PM2.5 监测器放置在 20 户家庭的外部,运行 24 小时。在这些家庭中,以四个连续缺失期(20%、40%、60%、80%)模拟缺失数据。使用单变量(均值、中位数、末次观测值延续、卡尔曼滤波、随机、马尔可夫)和多变量时间序列(预测均值匹配、行均值法)方法来插补缺失浓度,并使用五个误差度量(绝对偏差、均值的百分比绝对误差、R2 决定系数、均方根误差、平均绝对误差)来评估性能。在所有缺失水平下,马尔可夫、随机和均值插补的单变量方法是表现最好的方法,产生的 24 小时平均浓度误差最小,R2 值最高。在逐分钟评估误差度量时,卡尔曼滤波器、中位数和马尔可夫方法在缺失率较低(20-40%)时表现良好。然而,在更高的缺失率(60-80%)下,马尔可夫、随机、中位数和均值插补平均表现最佳。多变量方法在所有缺失水平下表现最差。使用单变量方法进行插补可能是解决空气污染物短期监测中缺失数据的合理方法,尤其是在资源有限的地区。需要进一步努力评估可推广到各种研究环境的插补方法。