BASF SE, Ludwigshafen am Rhein 67063, Germany.
Department of Pharmaceutical Sciences, Faculty of Life Sciences, University of Vienna, Vienna 1090, Austria.
J Chem Inf Model. 2021 Jul 26;61(7):3255-3272. doi: 10.1021/acs.jcim.1c00451. Epub 2021 Jun 21.
Computational methods such as machine learning approaches have a strong track record of success in predicting the outcomes of in vitro assays. In contrast, their ability to predict in vivo endpoints is more limited due to the high number of parameters and processes that may influence the outcome. Recent studies have shown that the combination of chemical and biological data can yield better models for in vivo endpoints. The ChemBioSim approach presented in this work aims to enhance the performance of conformal prediction models for in vivo endpoints by combining chemical information with (predicted) bioactivity assay outcomes. Three in vivo toxicological endpoints, capturing genotoxic (MNT), hepatic (DILI), and cardiological (DICC) issues, were selected for this study due to their high relevance for the registration and authorization of new compounds. Since the sparsity of available biological assay data is challenging for predictive modeling, predicted bioactivity descriptors were introduced instead. Thus, a machine learning model for each of the 373 collected biological assays was trained and applied on the compounds of the in vivo toxicity data sets. Besides the chemical descriptors (molecular fingerprints and physicochemical properties), these predicted bioactivities served as descriptors for the models of the three in vivo endpoints. For this study, a workflow based on a conformal prediction framework (a method for confidence estimation) built on random forest models was developed. Furthermore, the most relevant chemical and bioactivity descriptors for each in vivo endpoint were preselected with lasso models. The incorporation of bioactivity descriptors increased the mean F1 scores of the MNT model from 0.61 to 0.70 and for the DICC model from 0.72 to 0.82 while the mean efficiencies increased by roughly 0.10 for both endpoints. In contrast, for the DILI endpoint, no significant improvement in model performance was observed. Besides pure performance improvements, an analysis of the most important bioactivity features allowed detection of novel and less intuitive relationships between the predicted biological assay outcomes used as descriptors and the in vivo endpoints. This study presents how the prediction of in vivo toxicity endpoints can be improved by the incorporation of biological information-which is not necessarily captured by chemical descriptors-in an automated workflow without the need for adding experimental workload for the generation of bioactivity descriptors as predicted outcomes of bioactivity assays were utilized. All bioactivity CP models for deriving the predicted bioactivities, as well as the in vivo toxicity CP models, can be freely downloaded from https://doi.org/10.5281/zenodo.4761225.
计算方法,如机器学习方法,在预测体外检测结果方面有着成功的良好记录。相比之下,由于可能影响结果的参数和过程数量众多,它们预测体内终点的能力较为有限。最近的研究表明,化学和生物数据的结合可以为体内终点生成更好的模型。本文提出的 ChemBioSim 方法旨在通过将化学信息与(预测的)生物活性测定结果相结合,提高体内终点的保形预测模型的性能。选择三种体内毒性终点,分别为遗传毒性 (MNT)、肝脏毒性 (DILI) 和心脏毒性 (DICC),因为它们与新化合物的注册和授权高度相关。由于可用的生物测定数据的稀疏性对预测建模构成挑战,因此引入了预测的生物活性描述符。因此,为每个 373 个收集的生物测定训练了一个机器学习模型,并将其应用于体内毒性数据集的化合物上。除了化学描述符(分子指纹和物理化学性质)之外,这些预测的生物活性也作为三个体内终点模型的描述符。对于本研究,开发了一个基于随机森林模型的保形预测框架(一种置信度估计方法)的工作流程。此外,还使用套索模型预先选择了与每个体内终点最相关的化学和生物活性描述符。将生物活性描述符纳入模型后,MNT 模型的平均 F1 分数从 0.61 提高到 0.70,DICC 模型的平均 F1 分数从 0.72 提高到 0.82,而这两个终点的平均效率提高了约 0.10。相比之下,对于 DILI 终点,模型性能没有显著提高。除了纯性能的提高外,对最重要的生物活性特征的分析还允许检测到所使用的预测生物测定结果作为描述符与体内终点之间新的和不太直观的关系。本研究展示了如何通过在自动工作流程中纳入生物信息(不一定由化学描述符捕获)来提高体内毒性终点的预测,而无需增加生成生物活性描述符的实验工作量,因为利用了生物活性测定的预测结果作为预测生物活性。可以从 https://doi.org/10.5281/zenodo.4761225 自由下载所有用于生成预测生物活性的生物活性 CP 模型以及体内毒性 CP 模型。