Department of Civil and Environmental Engineering and Andlinger Center for Energy and the Environment, Princeton University, Princeton, New Jersey 08544, United States.
Environ Sci Technol. 2023 Nov 21;57(46):17671-17689. doi: 10.1021/acs.est.3c00026. Epub 2023 Jun 29.
Machine learning (ML) is increasingly used in environmental research to process large data sets and decipher complex relationships between system variables. However, due to the lack of familiarity and methodological rigor, inadequate ML studies may lead to spurious conclusions. In this study, we synthesized literature analysis with our own experience and provided a tutorial-like compilation of common pitfalls along with best practice guidelines for environmental ML research. We identified more than 30 key items and provided evidence-based data analysis based on 148 highly cited research articles to exhibit the misconceptions of terminologies, proper sample size and feature size, data enrichment and feature selection, randomness assessment, data leakage management, data splitting, method selection and comparison, model optimization and evaluation, and model explainability and causality. By analyzing good examples on supervised learning and reference modeling paradigms, we hope to help researchers adopt more rigorous data preprocessing and model development standards for more accurate, robust, and practicable model uses in environmental research and applications.
机器学习(ML)在环境研究中越来越多地用于处理大数据集,并揭示系统变量之间的复杂关系。然而,由于缺乏熟悉度和方法严谨性,不充分的 ML 研究可能导致虚假的结论。在这项研究中,我们综合了文献分析和我们自己的经验,提供了一个类似教程的常见陷阱汇编,以及环境 ML 研究的最佳实践指南。我们确定了 30 多个关键项目,并提供了基于 148 篇高引用研究文章的基于证据的数据分析,以展示术语、适当的样本量和特征量、数据丰富和特征选择、随机性评估、数据泄露管理、数据分割、方法选择和比较、模型优化和评估以及模型可解释性和因果关系方面的误解。通过分析监督学习和参考建模范例中的良好实例,我们希望帮助研究人员采用更严格的数据预处理和模型开发标准,以在环境研究和应用中更准确、稳健和实用地使用模型。