Suppr超能文献

利用ToxCast/Tox21生物测定数据确定用于可解释毒性预测模型的最佳机器学习算法和分子指纹

Identification of Optimal Machine Learning Algorithms and Molecular Fingerprints for Explainable Toxicity Prediction Models Using ToxCast/Tox21 Bioassay Data.

作者信息

Kim Donghyeon, Jeong Jaeseong, Choi Jinhee

机构信息

School of Environmental Engineering, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea.

出版信息

ACS Omega. 2024 Aug 27;9(36):37934-37941. doi: 10.1021/acsomega.4c04474. eCollection 2024 Sep 10.

Abstract

Recent studies have primarily focused on introducing novel frameworks to enhance the predictive power of toxicity prediction models by refining molecular representation methods and algorithms. However, these methods are inherently complex and often pose challenges in understanding and explaining, leading to barriers in their regulatory adoption and validation. Therefore, it is necessary to select the optimal model, considering not only model performance but also interpretability. This study aimed to identify the optimal combination of molecular fingerprints (pattern-based versus algorithm-based) and machine learning algorithms (simple versus complex) for developing explainable toxicity prediction models through an comprehensive investigation of the ToxCast/Tox21 bioassay data set. For 1092 ToxCast/Tox21 assays, five molecular fingerprints (MACCS, Morgan, RDKit, Layered, and Patterned) and six algorithms (MLP, GBT, Random Forest, NN, Logistic Regression, and Naïve Bayes) were used to train the models. Results showed that 35 models revealed acceptable performance (F1 score or accuracy is 0.8 or higher). Among the combinations, either MACCS or Morgan, paired with Random Forest, demonstrated robust performance compared with other molecular fingerprints and algorithms. MACCS and Random Forest are valuable, even when prioritizing interpretability. Consequently, the MACCS-Random Forest combination model based on four assays, targeting G protein-coupled receptor and kinase, were identified and they can be used to discern specific structural features or patterns in chemical compounds, offering explainable insights into toxicity-related chemical structures. This study indicates the importance of not disregarding the utilization of simple models when assessing both predictivity and interpretability within the context of chemical feature-based Tox21 data analysis.

摘要

最近的研究主要集中在引入新的框架,通过改进分子表示方法和算法来提高毒性预测模型的预测能力。然而,这些方法本质上很复杂,在理解和解释方面常常带来挑战,导致它们在监管采用和验证方面存在障碍。因此,有必要选择最优模型,不仅要考虑模型性能,还要考虑可解释性。本研究旨在通过对ToxCast/Tox21生物测定数据集的全面调查,确定分子指纹(基于模式与基于算法)和机器学习算法(简单与复杂)的最优组合,以开发可解释的毒性预测模型。对于1092个ToxCast/Tox21测定,使用了五种分子指纹(MACCS、摩根、RDKit、分层和图案化)和六种算法(多层感知器、梯度提升树、随机森林、神经网络、逻辑回归和朴素贝叶斯)来训练模型。结果表明,35个模型显示出可接受的性能(F1分数或准确率为0.8或更高)。在这些组合中,MACCS或摩根与随机森林配对,与其他分子指纹和算法相比表现出稳健的性能。即使优先考虑可解释性,MACCS和随机森林也是有价值的。因此,基于针对G蛋白偶联受体和激酶的四种测定,确定了MACCS-随机森林组合模型,它们可用于识别化合物中的特定结构特征或模式,为与毒性相关的化学结构提供可解释的见解。这项研究表明,在基于化学特征的Tox21数据分析的背景下评估预测性和可解释性时,不要忽视简单模型的使用是很重要的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8287/11391437/84828c647fab/ao4c04474_0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验