Zhang Yuan, Li Yanting, Li Yang, Zhao Lin, Yang Yongkui
School of Environmental Science and Engineering, Tianjin University, Tianjin 300350, China.
Georgia Tech Shenzhen Institute, Tianjin University, Shenzhen 518071, China.
Toxics. 2025 Jul 10;13(7):579. doi: 10.3390/toxics13070579.
Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy ( = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy ( = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.
机器学习(ML)技术在模拟植物系统中污染物的迁移方面正变得越来越有价值。然而,在使用ML预测水培系统中的迁移时,仍然存在两个挑战(样本量小和缺乏定量计算功能)。对于全氟和多氟烷基物质的生物累积,我们基于数据增强、ML和符号回归研究了关键因素和定量计算方程。首先,在数据预处理后对输入数据进行特征扩展;最重要的步骤是数据增强。通过结合合成少数过采样技术和变分自编码器,将原始训练集扩展了九倍。随后,将四个ML模型应用于测试集以预测选定的输出参数。分类提升(CatBoost)具有最高的预测准确率( = 0.83)。Shapley值表明分子量和暴露时间是最重要的参数。我们应用了三个符号回归模型,以基于原始数据和增强数据获得准确的预测方程。基于增强数据,高维稀疏相互作用方程表现出最高的准确率( = 0.776)。我们的结果表明,该方法可以为植物根系的吸收和积累提供关键见解。