Department of Statistics, Data Science and Modelling, National Institute of Public Health and the Environment, Bilthoven, The Netherlands.
Center for Safety of Substances and Products, National Institute of Public Health and the Environment, Bilthoven, The Netherlands.
SAR QSAR Environ Res. 2023 Oct-Dec;34(10):765-788. doi: 10.1080/1062936X.2023.2254225. Epub 2023 Sep 6.
Ecotoxicological safety assessment of chemicals requires toxicity data on multiple species, despite the general desire of minimizing animal testing. Predictive models, specifically machine learning (ML) methods, are one of the tools capable of solving this apparent contradiction as they allow to generalize toxicity patterns across chemicals and species. However, despite the availability of large public toxicity datasets, the data is highly sparse, complicating model development. The aim of this study is to provide insights into how ML can predict toxicity using a large but sparse dataset. We developed models to predict LC50-values, based on experimental LC50-data covering 2431 organic chemicals and 1506 aquatic species from the ECOTOX-database. Several well-known ML techniques were evaluated and a new ML model was developed, inspired by recommender systems. This new model involves a simple linear model that learns low-rank interactions between species and chemicals using factorization machines. We evaluated the predictive performances of the developed models based on two validation settings: 1) predicting unseen chemical-species pairs, and 2) predicting unseen chemicals. The results of this study show that ML models can accurately predict LC50-values in both validation settings. Moreover, we show that the novel factorization machine approach can match well-tuned, complex, ML approaches.
化学品的生态毒理学安全性评估需要多种物种的毒性数据,尽管人们普遍希望尽量减少动物测试。预测模型,特别是机器学习 (ML) 方法,是能够解决这一明显矛盾的工具之一,因为它们能够跨化学品和物种概括毒性模式。然而,尽管有大量公开的毒性数据集,但数据高度稀疏,这使得模型开发变得复杂。本研究的目的是提供关于如何使用大型但稀疏的数据集使用 ML 进行毒性预测的见解。我们使用来自 ECOTOX 数据库的涵盖 2431 种有机化学品和 1506 种水生物种的实验性 LC50 数据,开发了预测 LC50 值的模型。评估了几种知名的 ML 技术,并开发了一种新的 ML 模型,该模型受到推荐系统的启发。这种新模型涉及一种简单的线性模型,使用因子化机器学习物种和化学品之间的低秩交互。我们根据两种验证设置评估了所开发模型的预测性能:1)预测未见的化学-物种对,和 2)预测未见的化学品。本研究的结果表明,ML 模型可以在两种验证设置中准确预测 LC50 值。此外,我们表明新颖的因子化机器方法可以与经过精心调整的复杂 ML 方法相匹配。