Gao Feng, Zhang Wei, Baccarelli Andrea A, Shen Yike
Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States.
Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI 48823, United States.
Environ Int. 2022 May;163:107224. doi: 10.1016/j.envint.2022.107224. Epub 2022 Apr 1.
In silico prediction of chemical ecotoxicity (HC) represents an important complement to improve in vivo and in vitro toxicological assessment of manufactured chemicals. Recent application of machine learning models to predict chemical HC yields variable prediction performance that depends on effectively learning chemical representations from high-dimension data. To improve HC prediction performance, we developed an autoencoder model by learning latent space chemical embeddings. This novel approach achieved state-of-the-art prediction performance of HC with R of 0.668 ± 0.003 and mean absolute error (MAE) of 0.572 ± 0.001, and outperformed other dimension reduction methods including principal component analysis (PCA) (R = 0.601 ± 0.031 and MAE = 0.629 ± 0.005), kernel PCA (R = 0.631 ± 0.008 and MAE = 0.625 ± 0.006), and uniform manifold approximation and projection dimensionality reduction (R = 0.400 ± 0.008 and MAE = 0.801 ± 0.002). A simple linear layer with chemical embeddings learned from the autoencoder model performed better than random forest (R = 0.663 ± 0.007 and MAE = 0.591 ± 0.008), fully connected neural network (R = 0.614 ± 0.016 and MAE = 0.610 ± 0.008), least absolute shrinkage and selection operator (R = 0.617 ± 0.037 and MAE = 0.619 ± 0.007), and ridge regression (R = 0.638 ± 0.007 and MAE = 0.613 ± 0.005) using unlearned raw input features. Our results highlighted the usefulness of learning latent chemical representations, and our autoencoder model provides an alternative approach for robust HC prediction.
化学生态毒性(HC)的计算机模拟预测是改进人造化学品体内和体外毒理学评估的重要补充。最近应用机器学习模型预测化学HC的结果显示,其预测性能各不相同,这取决于能否有效地从高维数据中学习化学表征。为了提高HC预测性能,我们通过学习潜在空间化学嵌入开发了一种自动编码器模型。这种新方法实现了HC的最优预测性能,相关系数R为0.668±0.003,平均绝对误差(MAE)为0.572±0.001,并且优于其他降维方法,包括主成分分析(PCA)(R = 0.601±0.031,MAE = 0.629±0.005)、核主成分分析(R = 0.631±0.008,MAE = 0.625±0.006)以及均匀流形逼近与投影降维(R = 0.400±0.008,MAE = 0.801±0.002)。使用从自动编码器模型学习到的化学嵌入的简单线性层,其性能优于使用未学习的原始输入特征的随机森林(R = 0.663±0.007,MAE = 0.591±0.008)、全连接神经网络(R = 0.614±0.016,MAE = 0.610±0.008)、最小绝对收缩和选择算子(R = 0.617±0.037,MAE = 0.619±0.007)以及岭回归(R = 0.638±0.007,MAE = 0.613±0.005)。我们的结果突出了学习潜在化学表征的有用性,并且我们的自动编码器模型为可靠的HC预测提供了一种替代方法。