Li Siyuan, Shen Yuting, Gao Meng, Song Huatai, Ge Zhanpeng, Zhang Qiuyue, Xu Jiaping, Wang Yu, Sun Hongwen
MOE Key Laboratory of Pollution Processes and Environmental Criteria, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China.
Toxics. 2024 Oct 12;12(10):737. doi: 10.3390/toxics12100737.
To predict the behavior of aromatic contaminants (ACs) in complex soil-plant systems, this study developed machine learning (ML) models to estimate the root concentration factor (RCF) of both traditional (e.g., polycyclic aromatic hydrocarbons, polychlorinated biphenyls) and emerging ACs (e.g., phthalate acid esters, aryl organophosphate esters). Four ML algorithms were employed, trained on a unified RCF dataset comprising 878 data points, covering 6 features of soil-plant cultivation systems and 98 molecular descriptors of 55 chemicals, including 29 emerging ACs. The gradient-boosted regression tree (GBRT) model demonstrated strong predictive performance, with a coefficient of determination (R) of 0.75, a mean absolute error (MAE) of 0.11, and a root mean square error (RMSE) of 0.22, as validated by five-fold cross-validation. Multiple explanatory analyses highlighted the significance of soil organic matter (SOM), plant protein and lipid content, exposure time, and molecular descriptors related to electronegativity distribution pattern (GATS8e) and double-ring structure (fr_bicyclic). An increase in SOM was found to decrease the overall RCF, while other variables showed strong correlations within specific ranges. This GBRT model provides an important tool for assessing the environmental behaviors of ACs in soil-plant systems, thereby supporting further investigations into their ecological and human exposure risks.
为预测芳香族污染物(ACs)在复杂土壤-植物系统中的行为,本研究开发了机器学习(ML)模型,以估算传统ACs(如多环芳烃、多氯联苯)和新兴ACs(如邻苯二甲酸酯、芳基有机磷酸酯)的根浓度因子(RCF)。采用了四种ML算法,在一个统一的RCF数据集上进行训练,该数据集包含878个数据点,涵盖土壤-植物种植系统的6个特征以及55种化学物质的98个分子描述符,其中包括29种新兴ACs。经五折交叉验证,梯度提升回归树(GBRT)模型显示出强大的预测性能,决定系数(R)为0.75,平均绝对误差(MAE)为0.11,均方根误差(RMSE)为0.22。多项解释性分析突出了土壤有机质(SOM)、植物蛋白质和脂质含量、暴露时间以及与电负性分布模式(GATS8e)和双环结构(fr_bicyclic)相关的分子描述符的重要性。研究发现SOM的增加会降低整体RCF,而其他变量在特定范围内显示出强相关性。该GBRT模型为评估ACs在土壤-植物系统中的环境行为提供了一个重要工具,从而有助于进一步研究它们的生态和人类暴露风险。