• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用随机森林方法处理空气质量监测数据集的复杂缺失数据:以科威特环境数据(2012 年至 2018 年)为例。

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018).

机构信息

Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK.

Department of Earth and Environmental Sciences, Faculty of Science, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait.

出版信息

Int J Environ Res Public Health. 2021 Feb 2;18(3):1333. doi: 10.3390/ijerph18031333.

DOI:10.3390/ijerph18031333
PMID:33540610
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7908071/
Abstract

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

摘要

在环境研究中,缺失数据通常是统计建模面临的挑战。本文采用多重插补(MI)方法,针对空气质量数据集,介绍了一些处理缺失值的高级技术。本文应用了 MCAR、MAR 和 NMAR 缺失数据技术来处理数据集。考虑了五个缺失数据级别:5%、10%、20%、30%和 40%。本文使用的插补方法是迭代插补方法 missForest,它与随机森林方法有关。空气质量数据集来自科威特的五个监测站,汇总为每日数据。对所有污染物数据进行对数转换,以归一化其分布并最小化偏度。我们发现,NO2(18.4%)、CO(18.5%)、PM10(57.4%)、SO2(19.0%)和 O3(18.2%)数据的缺失值水平较高。气候数据(即空气温度、相对湿度、风向和风速)被用作更好估计的控制变量。结果表明,MAR 技术的 RMSE 和 MAE 最低。我们得出结论,使用 missForest 方法的 MI 在估计缺失值方面具有很高的准确性。与其他插补方法相比,missForest 的插补误差(RMSE 和 MAE)最低,因此可用于分析空气质量数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/37f2512c2592/ijerph-18-01333-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/9d2830502a99/ijerph-18-01333-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/a0fb5d84d459/ijerph-18-01333-g0A2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/95b920cd9b5c/ijerph-18-01333-g0A3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/fd574ae1160b/ijerph-18-01333-g0A4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/e1256bc0cdf0/ijerph-18-01333-g0A5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/aaa050374b81/ijerph-18-01333-g0A6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/4195118e06fc/ijerph-18-01333-g0A7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/e1090cda8500/ijerph-18-01333-g0A8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/3d18ac899c4d/ijerph-18-01333-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/ade5de63ed17/ijerph-18-01333-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/1089a5ea9f0e/ijerph-18-01333-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/bac37e766058/ijerph-18-01333-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/6c335eb85801/ijerph-18-01333-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/37f2512c2592/ijerph-18-01333-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/9d2830502a99/ijerph-18-01333-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/a0fb5d84d459/ijerph-18-01333-g0A2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/95b920cd9b5c/ijerph-18-01333-g0A3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/fd574ae1160b/ijerph-18-01333-g0A4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/e1256bc0cdf0/ijerph-18-01333-g0A5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/aaa050374b81/ijerph-18-01333-g0A6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/4195118e06fc/ijerph-18-01333-g0A7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/e1090cda8500/ijerph-18-01333-g0A8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/3d18ac899c4d/ijerph-18-01333-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/ade5de63ed17/ijerph-18-01333-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/1089a5ea9f0e/ijerph-18-01333-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/bac37e766058/ijerph-18-01333-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/6c335eb85801/ijerph-18-01333-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfc7/7908071/37f2512c2592/ijerph-18-01333-g006.jpg

相似文献

1
Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018).利用随机森林方法处理空气质量监测数据集的复杂缺失数据:以科威特环境数据(2012 年至 2018 年)为例。
Int J Environ Res Public Health. 2021 Feb 2;18(3):1333. doi: 10.3390/ijerph18031333.
2
Handling missing data in a rheumatoid arthritis registry using random forest approach.采用随机森林方法处理类风湿关节炎注册研究中的缺失数据。
Int J Rheum Dis. 2021 Oct;24(10):1282-1293. doi: 10.1111/1756-185X.14203. Epub 2021 Aug 12.
3
missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data.使用二进制粒子群优化进行特征选择的 missForest 提高了连续数据的插补准确性。
Genes Genomics. 2022 Jun;44(6):651-658. doi: 10.1007/s13258-022-01247-8. Epub 2022 Apr 6.
4
Forecasts of tropospheric ozone in the Metropolitan Area of Rio de Janeiro based on missing data imputation and multivariate calibration techniques.基于缺失数据插补和多元校准技术的里约热内卢大都市区对流层臭氧预测。
Environ Monit Assess. 2021 Jul 28;193(8):531. doi: 10.1007/s10661-021-09333-2.
5
Effects of short-term exposure to air pollution on hospital admissions of young children for acute lower respiratory infections in Ho Chi Minh City, Vietnam.越南胡志明市短期暴露于空气污染对幼儿急性下呼吸道感染住院率的影响。
Res Rep Health Eff Inst. 2012 Jun(169):5-72; discussion 73-83.
6
The impact of the congestion charging scheme on air quality in London. Part 1. Emissions modeling and analysis of air pollution measurements.拥堵收费计划对伦敦空气质量的影响。第1部分。排放建模与空气污染测量分析。
Res Rep Health Eff Inst. 2011 Apr(155):5-71.
7
[Meta-analysis of the Italian studies on short-term effects of air pollution--MISA 1996-2002].[意大利空气污染短期影响研究的荟萃分析——MISA 1996 - 2002]
Epidemiol Prev. 2004 Jul-Oct;28(4-5 Suppl):4-100.
8
Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.基于随机森林的缺失数据插补在非正态性、非线性和交互作用存在下的准确性。
BMC Med Res Methodol. 2020 Jul 25;20(1):199. doi: 10.1186/s12874-020-01080-1.
9
[Meta-analysis of the Italian studies on short-term effects of air pollution].[意大利关于空气污染短期影响研究的荟萃分析]
Epidemiol Prev. 2001 Mar-Apr;25(2 Suppl):1-71.
10
MissForest--non-parametric missing value imputation for mixed-type data.MissForest--用于混合类型数据的非参数缺失值插补。
Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.

引用本文的文献

1
Integrating Artificial Intelligence in Environmental Monitoring: A Paradigm Shift in Data-Driven Sustainability.将人工智能整合到环境监测中:数据驱动型可持续发展的范式转变。
Ecohealth. 2025 Aug 28. doi: 10.1007/s10393-025-01752-8.
2
Nationwide Machine Learning-Ensemble PM Mapping Prediction and Forecasting Models in South Korea with High Spatiotemporal Resolution and Health Risk Estimation-Based Evaluations.韩国具有高时空分辨率和基于健康风险估计评估的全国性机器学习集成颗粒物映射预测与预报模型。
Environ Health (Wash). 2025 Apr 23;3(8):878-887. doi: 10.1021/envhealth.4c00201. eCollection 2025 Aug 15.
3
Selective mortality during famine and plague events in medieval London.

本文引用的文献

1
Influence of Ambient Air Pollution on Rheumatoid Arthritis Disease Activity Score Index.大气污染对类风湿关节炎疾病活动评分指数的影响。
Int J Environ Res Public Health. 2020 Jan 8;17(2):416. doi: 10.3390/ijerph17020416.
2
Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study.基于随机森林的插补方法在 LC-MS 代谢组学数据插补方面优于其他方法:一项比较研究。
BMC Bioinformatics. 2019 Oct 11;20(1):492. doi: 10.1186/s12859-019-3110-0.
3
Random Forest Missing Data Algorithms.随机森林缺失数据算法
中世纪伦敦饥荒和瘟疫事件中的选择性死亡
Sci Rep. 2025 Jul 25;15(1):27133. doi: 10.1038/s41598-025-13198-7.
4
Unsupervised Clustering of Patients Undergoing Thoracoscopic Ablation Identifies Relevant Phenotypes for Advanced Atrial Fibrillation.接受胸腔镜消融治疗患者的无监督聚类识别出晚期心房颤动的相关表型。
Diagnostics (Basel). 2025 May 16;15(10):1269. doi: 10.3390/diagnostics15101269.
5
Unravelling the association of glycosylated haemoglobin A1c, blood pressure, and LDL-cholesterol (ABC) with all-cause mortality in Type 2 diabetes patients: insights from a middle-income country.探究2型糖尿病患者糖化血红蛋白A1c、血压和低密度脂蛋白胆固醇(ABC)与全因死亡率之间的关联:来自一个中等收入国家的见解
J Diabetes Metab Disord. 2025 Apr 30;24(1):111. doi: 10.1007/s40200-025-01620-w. eCollection 2025 Jun.
6
Predictive modeling of climate change impacts using Artificial Intelligence: a review for equitable governance and sustainable outcome.利用人工智能对气候变化影响进行预测建模:关于公平治理与可持续成果的综述
Environ Sci Pollut Res Int. 2025 Apr;32(17):10705-10724. doi: 10.1007/s11356-025-36356-w. Epub 2025 Apr 4.
7
Spatio-temporal characterization of PM10 concentration across Abu Dhabi Emirate (UAE).阿联酋阿布扎比酋长国PM10浓度的时空特征
Heliyon. 2024 Jun 13;10(12):e32812. doi: 10.1016/j.heliyon.2024.e32812. eCollection 2024 Jun 30.
8
Optimizing cardiovascular disease mortality prediction: a super learner approach in the tehran lipid and glucose study.优化心血管疾病死亡率预测:特兰脂质和血糖研究中的超级学习者方法。
BMC Med Inform Decis Mak. 2024 Apr 16;24(1):97. doi: 10.1186/s12911-024-02489-0.
9
Targeted therapy using polymyxin B hemadsorption in patients with sepsis: a post-hoc analysis of the JSEPTIC-DIC study and the EUPHRATES trial.使用多黏菌素 B 血液吸附治疗脓毒症患者的靶向治疗:JSEPTIC-DIC 研究和 EUPHRATES 试验的事后分析。
Crit Care. 2023 Jun 21;27(1):245. doi: 10.1186/s13054-023-04533-3.
10
Evaluation of roadside air quality using deep learning models after the application of the diesel vehicle policy (Euro 6).应用柴油机政策(欧六)后,利用深度学习模型评估路边空气质量。
Sci Rep. 2022 Dec 1;12(1):20769. doi: 10.1038/s41598-022-24886-z.
Stat Anal Data Min. 2017 Dec;10(6):363-377. doi: 10.1002/sam.11348. Epub 2017 Jun 13.
4
Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective.多元缺失数据问题的多重填补:数据分析师视角
Multivariate Behav Res. 1998 Oct 1;33(4):545-71. doi: 10.1207/s15327906mbr3304_5.
5
Missing value imputation in high-dimensional phenomic data: imputable or not, and how?高维表型组数据中的缺失值插补:是否可插补以及如何插补?
BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.
6
Log-transformation and its implications for data analysis.对数变换及其对数据分析的影响。
Shanghai Arch Psychiatry. 2014 Apr;26(2):105-9. doi: 10.3969/j.issn.1002-0829.2014.02.009.
7
Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables.在不完全分类变量的多重填补中避免因完美预测导致的偏差。
Comput Stat Data Anal. 2010 Oct 1;54(10):2267-2275. doi: 10.1016/j.csda.2010.04.005.
8
Influence of pattern of missing data on performance of imputation methods: an example using national data on drug injection in prisons.缺失数据模式对插补方法性能的影响:一个使用监狱药物注射全国数据的实例。
Int J Health Policy Manag. 2013 Jun 3;1(1):69-77. doi: 10.15171/ijhpm.2013.11. eCollection 2013 Jun.
9
Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.基于 MICE 使用随机森林和参数插补模型比较缺失数据插补:CALIBER 研究。
Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.
10
Recovery of information from multiple imputation: a simulation study.从多重填补中恢复信息:一项模拟研究。
Emerg Themes Epidemiol. 2012 Jun 13;9(1):3. doi: 10.1186/1742-7622-9-3.