使用化学结构和靶向转录组数据预测肝毒性潜力的机器学习方法比较

A Comparison of Machine Learning Approaches for predicting Hepatotoxicity potential using Chemical Structure and Targeted Transcriptomic Data.

作者信息

Tate Tia, Patlewicz Grace, Shah Imran

机构信息

Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 27709, USA.

出版信息

Comput Toxicol. 2024 Mar;29:1-14. doi: 10.1016/j.comtox.2024.100301.

DOI:10.1016/j.comtox.2024.100301

PMID:38993502

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11235188/

Abstract

Animal toxicity testing is time and resource intensive, making it difficult to keep pace with the number of substances requiring assessment. Machine learning (ML) models that use chemical structure information and high-throughput experimental data can be helpful in predicting potential toxicity . However, much of the toxicity data used to train ML models is biased with an unequal balance of positives and negatives primarily since substances selected for in vivo testing are expected to elicit some toxicity effect. To investigate the impact this bias had on predictive performance, various sampling approaches were used to balance in vivo toxicity data as part of a supervised ML workflow to predict hepatotoxicity outcomes from chemical structure and/or targeted transcriptomic data. From the chronic, subchronic, developmental, multigenerational reproductive, and subacute repeat-dose testing toxicity outcomes with a minimum of 50 positive and 50 negative substances, 18 different study-toxicity outcome combinations were evaluated in up to 7 ML models. These included Artificial Neural Networks, Random Forests, Bernouilli Naïve Bayes, Gradient Boosting, and Support Vector classification algorithms which were compared with a local approach, Generalised Read-Across (GenRA), a similarity-weighted k-Nearest Neighbour (k-NN) method. The mean CV F1 performance for unbalanced data across all classifiers and descriptors for chronic liver effects was 0.735 (0.0395 SD). Mean CV F1 performance dropped to 0.639 (0.073 SD) with over-sampling approaches though the poorer performance of KNN approaches in some cases contributed to the observed decrease (mean CV F1 performance excluding KNN was 0.697 (0.072 SD)). With under-sampling approaches, the mean CV F1 was 0.523 (0.083 SD). For developmental liver effects, the mean CV F1 performance was much lower with 0.089 (0.111 SD) for unbalanced approaches and 0.149 (0.084 SD) for under-sampling. Over-sampling approaches led to an increase in mean CV F1 performance (0.234, (0.107 SD)) for developmental liver toxicity. Model performance was found to be dependent on dataset, model type, balancing approach and feature selection. Accordingly tailoring ML workflows for predicting toxicity should consider class imbalance and rely on simpler classifiers first.

摘要

动物毒性测试耗费时间和资源，因此难以跟上需要评估的物质数量的增长速度。利用化学结构信息和高通量实验数据的机器学习（ML）模型有助于预测潜在毒性。然而，用于训练ML模型的许多毒性数据存在偏差，阳性和阴性数据的平衡不均，主要原因是选择用于体内测试的物质预计会引发某种毒性效应。为了研究这种偏差对预测性能的影响，在一个监督式ML工作流程中，使用了各种采样方法来平衡体内毒性数据，以便从化学结构和/或靶向转录组数据预测肝毒性结果。从慢性、亚慢性、发育、多代生殖和亚急性重复剂量测试的毒性结果中，选取至少50种阳性和50种阴性物质，在多达7种ML模型中评估了18种不同的研究-毒性结果组合。这些模型包括人工神经网络、随机森林、伯努利朴素贝叶斯、梯度提升和支持向量分类算法，并与一种局部方法广义类推法（GenRA）、一种相似性加权k近邻（k-NN）方法进行了比较。所有分类器和描述符针对慢性肝脏效应的不平衡数据的平均CV F1性能为0.735（标准差0.0395）。通过过采样方法，平均CV F1性能降至0.639（标准差0.073），不过在某些情况下KNN方法的较差性能导致了观察到的下降（不包括KNN的平均CV F1性能为0.697（标准差0.072））。采用欠采样方法时，平均CV F1为0.523（标准差0.083）。对于发育性肝脏效应，不平衡方法的平均CV F1性能要低得多，为0.089（标准差0.111），欠采样为0.149（标准差0.084）。过采样方法使发育性肝毒性的平均CV F1性能有所提高（0.234，标准差0.107）。发现模型性能取决于数据集、模型类型、平衡方法和特征选择。因此，为预测毒性量身定制ML工作流程时应考虑类别不平衡，并首先依赖更简单的分类器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e01b/11235188/059d6677a32e/nihms-1993325-f0001.jpg

相似文献

A Comparison of Machine Learning Approaches for predicting Hepatotoxicity potential using Chemical Structure and Targeted Transcriptomic Data.使用化学结构和靶向转录组数据预测肝毒性潜力的机器学习方法比较

Comput Toxicol. 2024 Mar;29:1-14. doi: 10.1016/j.comtox.2024.100301.

Repeat-dose toxicity prediction with Generalized Read-Across (GenRA) using targeted transcriptomic data: A proof-of-concept case study.使用靶向转录组数据通过广义类推法（GenRA）进行重复剂量毒性预测：一个概念验证案例研究。

Comput Toxicol. 2021 Aug 1;19:1-12. doi: 10.1016/j.comtox.2021.100171.

Predicting hepatotoxicity using ToxCast in vitro bioactivity and chemical structure.利用ToxCast体外生物活性和化学结构预测肝毒性。

Chem Res Toxicol. 2015 Apr 20;28(4):738-51. doi: 10.1021/tx500501h. Epub 2015 Mar 9.

Predicting Organ Toxicity Using in Vitro Bioactivity Data and Chemical Structure.利用体外生物活性数据和化学结构预测器官毒性

Chem Res Toxicol. 2017 Nov 20;30(11):2046-2059. doi: 10.1021/acs.chemrestox.7b00084. Epub 2017 Oct 9.

Machine learning algorithms for predicting COVID-19 mortality in Ethiopia.用于预测埃塞俄比亚 COVID-19 死亡率的机器学习算法。

BMC Public Health. 2024 Jun 28;24(1):1728. doi: 10.1186/s12889-024-19196-0.

Transitioning the Generalised Read-Across approach (GenRA) to quantitative predictions: A case study using acute oral toxicity data.将广义类推法（GenRA）转变为定量预测：一项使用急性经口毒性数据的案例研究。

Comput Toxicol. 2019 Nov 1;12(November 2019). doi: 10.1016/j.comtox.2019.100097.

Predicting Treatment Outcomes in Patients with Low Back Pain Using Gene Signature-Based Machine Learning Models.使用基于基因特征的机器学习模型预测腰痛患者的治疗结果。

Pain Ther. 2025 Feb;14(1):359-373. doi: 10.1007/s40122-024-00700-8. Epub 2024 Dec 25.

Quantitative prediction of repeat dose toxicity values using GenRA.利用 GenRA 对重复剂量毒性值进行定量预测。

Regul Toxicol Pharmacol. 2019 Dec;109:104480. doi: 10.1016/j.yrtph.2019.104480. Epub 2019 Sep 21.

Extending the Generalised Read-Across approach (GenRA): A systematic analysis of the impact of physicochemical property information on read-across performance.扩展广义类推法（GenRA）：物理化学性质信息对类推性能影响的系统分析

Comput Toxicol. 2018;8:34-50. doi: 10.1016/j.comtox.2018.07.001.

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略：以脑出血为例。

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

引用本文的文献

Artif Intell Chem. 2024 Dec;2(2). doi: 10.1016/j.aichem.2024.100077. Epub 2024 Aug 31.

本文引用的文献

Towards systematic read-across using Generalised Read-Across (GenRA).迈向使用广义类推法（GenRA）进行系统的类推。

Comput Toxicol. 2023 Feb;25:1-15. doi: 10.1016/j.comtox.2022.100258.

Comput Toxicol. 2021 Aug 1;19:1-12. doi: 10.1016/j.comtox.2021.100171.

Generalized Read-Across prediction using genra-py.使用genra-py进行广义的类推预测。

Bioinformatics. 2021 Oct 11;37(19):3380-3381. doi: 10.1093/bioinformatics/btab210.

High-Throughput Transcriptomics Platform for Screening Environmental Chemicals.高通量转录组学平台用于筛选环境化学物质。

Toxicol Sci. 2021 Apr 27;181(1):68-89. doi: 10.1093/toxsci/kfab009.

High-throughput toxicogenomic screening of chemicals in the environment using metabolically competent hepatic cell cultures.利用代谢活性的肝细胞培养物进行环境化学物高通量毒基因组学筛选。

NPJ Syst Biol Appl. 2021 Jan 27;7(1):7. doi: 10.1038/s41540-020-00166-2.

Computational Models Using Multiple Machine Learning Algorithms for Predicting Drug Hepatotoxicity with the DILIrank Dataset.使用多种机器学习算法的计算模型，结合 DILIrank 数据集预测药物肝毒性。

Int J Mol Sci. 2020 Mar 19;21(6):2114. doi: 10.3390/ijms21062114.

Predictive Models for Human Organ Toxicity Based on Bioactivity Data and Chemical Structure.基于生物活性数据和化学结构的人体器官毒性预测模型。

Chem Res Toxicol. 2020 Mar 16;33(3):731-741. doi: 10.1021/acs.chemrestox.9b00305. Epub 2020 Mar 3.

Bioactivity screening of environmental chemicals using imaging-based high-throughput phenotypic profiling.利用基于成像的高通量表型分析技术对环境化学物质进行生物活性筛选。

Toxicol Appl Pharmacol. 2020 Jan 15;389:114876. doi: 10.1016/j.taap.2019.114876. Epub 2019 Dec 30.

Comput Toxicol. 2018;8:34-50. doi: 10.1016/j.comtox.2018.07.001.

ToxRefDB version 2.0: Improved utility for predictive and retrospective toxicology analyses.ToxRefDB 版本 2.0：用于预测和回顾性毒理学分析的改进工具。

Reprod Toxicol. 2019 Oct;89:145-158. doi: 10.1016/j.reprotox.2019.07.012. Epub 2019 Jul 21.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验