• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

机器学习在环境研究中的应用:常见陷阱与最佳实践。

Machine Learning in Environmental Research: Common Pitfalls and Best Practices.

机构信息

Department of Civil and Environmental Engineering and Andlinger Center for Energy and the Environment, Princeton University, Princeton, New Jersey 08544, United States.

出版信息

Environ Sci Technol. 2023 Nov 21;57(46):17671-17689. doi: 10.1021/acs.est.3c00026. Epub 2023 Jun 29.

DOI:10.1021/acs.est.3c00026
PMID:37384597
Abstract

Machine learning (ML) is increasingly used in environmental research to process large data sets and decipher complex relationships between system variables. However, due to the lack of familiarity and methodological rigor, inadequate ML studies may lead to spurious conclusions. In this study, we synthesized literature analysis with our own experience and provided a tutorial-like compilation of common pitfalls along with best practice guidelines for environmental ML research. We identified more than 30 key items and provided evidence-based data analysis based on 148 highly cited research articles to exhibit the misconceptions of terminologies, proper sample size and feature size, data enrichment and feature selection, randomness assessment, data leakage management, data splitting, method selection and comparison, model optimization and evaluation, and model explainability and causality. By analyzing good examples on supervised learning and reference modeling paradigms, we hope to help researchers adopt more rigorous data preprocessing and model development standards for more accurate, robust, and practicable model uses in environmental research and applications.

摘要

机器学习(ML)在环境研究中越来越多地用于处理大数据集,并揭示系统变量之间的复杂关系。然而,由于缺乏熟悉度和方法严谨性,不充分的 ML 研究可能导致虚假的结论。在这项研究中,我们综合了文献分析和我们自己的经验,提供了一个类似教程的常见陷阱汇编,以及环境 ML 研究的最佳实践指南。我们确定了 30 多个关键项目,并提供了基于 148 篇高引用研究文章的基于证据的数据分析,以展示术语、适当的样本量和特征量、数据丰富和特征选择、随机性评估、数据泄露管理、数据分割、方法选择和比较、模型优化和评估以及模型可解释性和因果关系方面的误解。通过分析监督学习和参考建模范例中的良好实例,我们希望帮助研究人员采用更严格的数据预处理和模型开发标准,以在环境研究和应用中更准确、稳健和实用地使用模型。

相似文献

1
Machine Learning in Environmental Research: Common Pitfalls and Best Practices.机器学习在环境研究中的应用:常见陷阱与最佳实践。
Environ Sci Technol. 2023 Nov 21;57(46):17671-17689. doi: 10.1021/acs.est.3c00026. Epub 2023 Jun 29.
2
Key concepts, common pitfalls, and best practices in artificial intelligence and machine learning: focus on radiomics.人工智能和机器学习中的关键概念、常见陷阱和最佳实践:关注放射组学。
Diagn Interv Radiol. 2022 Sep;28(5):450-462. doi: 10.5152/dir.2022.211297.
3
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
4
EAACI guidelines on environmental science in allergic diseases and asthma - Leveraging artificial intelligence and machine learning to develop a causality model in exposomics.变应性疾病和哮喘环境科学的 EAACI 指南——利用人工智能和机器学习开发暴露组学中的因果模型。
Allergy. 2023 Jul;78(7):1742-1757. doi: 10.1111/all.15667. Epub 2023 Feb 15.
5
Machine Learning: New Ideas and Tools in Environmental Science and Engineering.机器学习:环境科学与工程中的新思想和新工具。
Environ Sci Technol. 2021 Oct 5;55(19):12741-12754. doi: 10.1021/acs.est.1c01339. Epub 2021 Aug 17.
6
Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes.基于数据驱动的血糖动力学建模与预测:机器学习在 1 型糖尿病中的应用。
Artif Intell Med. 2019 Jul;98:109-134. doi: 10.1016/j.artmed.2019.07.007. Epub 2019 Jul 26.
7
How to read and review papers on machine learning and artificial intelligence in radiology: a survival guide to key methodological concepts.如何阅读和审查放射学中的机器学习和人工智能论文:关键方法学概念的生存指南。
Eur Radiol. 2021 Apr;31(4):1819-1830. doi: 10.1007/s00330-020-07324-4. Epub 2020 Oct 1.
8
Foundations of Machine Learning-Based Clinical Prediction Modeling: Part I-Introduction and General Principles.基于机器学习的临床预测建模基础:第一部分——引言和一般原则。
Acta Neurochir Suppl. 2022;134:7-13. doi: 10.1007/978-3-030-85292-4_2.
9
Measuring the Usability and Quality of Explanations of a Machine Learning Web-Based Tool for Oral Tongue Cancer Prognostication.测量基于 Web 的机器学习工具对口腔舌癌预后解释的可用性和质量。
Int J Environ Res Public Health. 2022 Jul 8;19(14):8366. doi: 10.3390/ijerph19148366.
10
Confound-leakage: confound removal in machine learning leads to leakage.混杂-泄露:机器学习中的混杂去除导致泄露。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad071. Epub 2023 Sep 30.

引用本文的文献

1
Real-Time Cell Gap Estimation in LC-Filled Devices Using Lightweight Neural Networks for Edge Deployment.使用轻量级神经网络进行边缘部署的液晶填充设备中的实时细胞间隙估计
Nanomaterials (Basel). 2025 Aug 21;15(16):1289. doi: 10.3390/nano15161289.
2
Dataset on wastewater quality monitoring with adsorption and reflectance spectrometry in the UV-vis range.紫外可见范围内采用吸附和反射光谱法进行废水水质监测的数据集。
Sci Data. 2025 Jul 25;12(1):1296. doi: 10.1038/s41597-025-05459-x.
3
A methodology for designing accurate, modifiable and reproducible scientific graphics in environmental studies using GPT4Designer.
一种使用GPT4Designer在环境研究中设计准确、可修改和可重现的科学图形的方法。
Sci Rep. 2025 Jul 1;15(1):21643. doi: 10.1038/s41598-025-00300-2.
4
Hydraulic Connectivity and Hydrochemistry Influence Microbial Community Structure in Agriculturally Affected Alluvial Aquifers in the Midwestern United States.水力连通性和水化学影响美国中西部受农业影响的冲积含水层中的微生物群落结构。
Environ Sci Technol. 2025 Jun 24;59(24):12279-12291. doi: 10.1021/acs.est.5c03155. Epub 2025 Jun 12.
5
Cadmium accumulation in wheat grain: Accumulation models and soil thresholds for safe production.小麦籽粒中的镉积累:积累模型与安全生产的土壤阈值
Eco Environ Health. 2025 May 14;4(2):100154. doi: 10.1016/j.eehl.2025.100154. eCollection 2025 Jun.
6
Evaluation and Source Analysis of Plant Heavy Metal Pollution in Kalamaili Mountain Nature Reserve.卡拉麦里山自然保护区植物重金属污染评价及源分析
Plants (Basel). 2025 May 19;14(10):1521. doi: 10.3390/plants14101521.
7
Prediction and validation of nanowire proteins in G20 using machine learning and feature engineering.使用机器学习和特征工程对G20中的纳米线蛋白进行预测与验证。
Comput Struct Biotechnol J. 2025 Apr 19;27:1706-1718. doi: 10.1016/j.csbj.2025.04.022. eCollection 2025.
8
Enhancing Differentiation of Oxygenated Organic Aerosol: A Machine Learning Approach to Distinguish Local and Transboundary Pollution.增强含氧有机气溶胶的鉴别:一种区分本地污染和跨境污染的机器学习方法。
ACS EST Air. 2025 Apr 15;2(5):891-902. doi: 10.1021/acsestair.4c00331. eCollection 2025 May 9.
9
MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data.MLinvitroTox 重新加载,用于基于高通量危害的高分辨率质谱数据优先级排序。
J Cheminform. 2025 Jan 31;17(1):14. doi: 10.1186/s13321-025-00950-4.
10
A probabilistic deep learning approach to enhance the prediction of wastewater treatment plant effluent quality under shocking load events.一种概率深度学习方法,用于增强对冲击负荷事件下污水处理厂出水水质的预测。
Water Res X. 2024 Dec 3;26:100291. doi: 10.1016/j.wroa.2024.100291. eCollection 2025 Jan 1.