Suppr超能文献

基于分子结构的机器学习模型直接预测土壤中植物根系对有机污染物的积累。

Direct Prediction of Bioaccumulation of Organic Contaminants in Plant Roots from Soils with Machine Learning Models Based on Molecular Structures.

机构信息

Department of Genetics, School of Medicine, Yale University, New Haven, Connecticut 06510, United States.

Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, New York 10032, United States.

出版信息

Environ Sci Technol. 2021 Dec 21;55(24):16358-16368. doi: 10.1021/acs.est.1c02376. Epub 2021 Dec 3.

Abstract

Root concentration factor (RCF) is an important characterization parameter to describe accumulation of organic contaminants in plants from soils in life cycle impact assessment (LCIA) and phytoremediation potential assessment. However, building robust predictive models remains challenging due to the complex interactions among chemical-soil-plant root systems. Here we developed end-to-end machine learning models to devolve the complex molecular structure relationship with RCF by training on a unified RCF data set with 341 data points covering 72 chemicals. We demonstrate the efficacy of the proposed gradient boosting regression tree (GBRT) model based on the extended connectivity fingerprints (ECFP) by predicting RCF values and achieved prediction performance with R-squared of 0.77 and mean absolute error (MAE) of 0.22 using 5-fold cross validation. In addition, our results reveal nonlinear relationships among properties of chemical, soil, and plant. Further in-depth analyses identify the key chemical topological substructures (e.g., -O, -Cl, aromatic rings and large conjugated π systems) related to RCF. Stemming from its simplicity and universality, the GBRT-ECFP model provides a valuable tool for LCIA and other environmental assessments to better characterize chemical risks to human health and ecosystems.

摘要

根集中系数(RCF)是描述生命周期影响评估(LCIA)和植物修复潜力评估中土壤中有机污染物在植物中积累的重要特征化参数。然而,由于化学-土壤-植物根系之间的复杂相互作用,建立稳健的预测模型仍然具有挑战性。在这里,我们开发了端到端机器学习模型,通过在包含 72 种化学物质的 341 个数据点的统一 RCF 数据集上进行训练,来分解与 RCF 相关的复杂分子结构关系。我们通过预测 RCF 值来证明基于扩展连接指纹(ECFP)的梯度提升回归树(GBRT)模型的有效性,并使用 5 折交叉验证实现了 R-squared 为 0.77 和平均绝对误差(MAE)为 0.22 的预测性能。此外,我们的结果揭示了化学物质、土壤和植物性质之间的非线性关系。进一步的深入分析确定了与 RCF 相关的关键化学拓扑子结构(例如,-O,-Cl,芳环和大共轭π系统)。源自其简单性和通用性,GBRT-ECFP 模型为 LCIA 和其他环境评估提供了有价值的工具,以更好地描述对人类健康和生态系统的化学风险。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验