基于结构-活性关系的高度不平衡Tox21数据集的化学分类

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

作者信息

Idakwo Gabriel, Thangapandian Sundar, Luttrell Joseph, Li Yan, Wang Nan, Zhou Zhaoxian, Hong Huixiao, Yang Bei, Zhang Chaoyang, Gong Ping

机构信息

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.

Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.

出版信息

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

DOI:10.1186/s13321-020-00468-x

PMID:33372637

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7592558/

Abstract

The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure-Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman's aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.

摘要

毒物与靶标生物分子相互作用的特异性导致许多毒性数据集的性质极不均衡，从而在基于构效关系（SAR）的化学分类中表现不佳。欠采样和过采样是应对这种不平衡挑战的代表性技术。然而，使用欠采样技术从多数类中去除非活性化合物实例可能会导致信息丢失，而通过插值增加少数类中的活性毒物实例往往会引入人为的少数类实例，这些实例常常会跨越到多数类空间，导致类重叠和更高的错误预测率。在本研究中，为了提高不平衡学习的预测准确性，我们采用了SMOTEENN，即合成少数类过采样技术（SMOTE）和编辑最近邻（ENN）算法的组合，通过创建合成样本对少数类进行过采样，然后清理错误标记的实例。我们选择了高度不平衡的Tox21数据集，该数据集由针对超过10000种化学物质的12种体外生物测定组成，这些化学物质在二元类之间分布不均。以随机森林（RF）作为基础分类器，以装袋作为集成策略，我们应用了四种混合学习方法，即不进行不平衡处理的RF（RF）、随机欠采样的RF（RUS）、使用SMOTE的RF（SMO）和使用SMOTEENN的RF（SMN）。使用九个评估指标比较了这四种学习方法的性能，其中F分数、马修斯相关系数和布里尔分数对12个数据集的整体性能提供了更一致的评估。弗里德曼对齐秩检验和随后的伯格曼-霍梅尔事后检验表明，SMN明显优于其他三种方法。我们还发现预测准确性与不平衡率（IR）之间存在很强的负相关，不平衡率定义为非活性化合物数量除以活性化合物数量。当IR超过某个阈值（例如，>28）时，SMN的效果会变差。在计算毒理学中，将少量活性化合物与大量非活性化合物区分开来的能力非常重要。这项工作表明，通过使用数据重平衡可以显著提高基于SAR的不平衡化学毒性分类的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d41/7592558/9607897165e2/13321_2020_468_Fig1_HTML.jpg

相似文献

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data.

J Biomed Inform. 2020 Jul;107:103465. doi: 10.1016/j.jbi.2020.103465. Epub 2020 Jun 5.

Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction.

Curr Comput Aided Drug Des. 2024 Sep 24. doi: 10.2174/0115734099315538240909101737.

Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods.

J Chem Inf Model. 2013 Dec 23;53(12):3244-61. doi: 10.1021/ci400527b. Epub 2013 Dec 11.

A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis.

BMC Med Inform Decis Mak. 2022 Dec 29;22(1):344. doi: 10.1186/s12911-022-02075-2.

A hybrid resampling algorithms SMOTE and ENN based deep learning models for identification of Marburg virus inhibitors.

Future Med Chem. 2022 May;14(10):701-715. doi: 10.4155/fmc-2021-0290. Epub 2022 Apr 8.

RSMOTE: improving classification performance over imbalanced medical datasets.

Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.

Identifying Protein Features and Pathways Responsible for Toxicity Using Machine Learning and Tox21: Implications for Predictive Toxicology.

Molecules. 2022 May 8;27(9):3021. doi: 10.3390/molecules27093021.

A hybrid Stacking-SMOTE model for optimizing the prediction of autistic genes.

BMC Bioinformatics. 2023 Oct 6;24(1):379. doi: 10.1186/s12859-023-05501-y.

引用本文的文献

Adjusted imbalance ratio leads to effective AI-based drug discovery against infectious disease.

Sci Rep. 2025 Aug 12;15(1):29563. doi: 10.1038/s41598-025-15265-5.

Interpretable machine learning models for predicting the antitumor effects of metal and metal oxide nanomaterials.

RSC Adv. 2025 Jun 3;15(21):17036-17048. doi: 10.1039/d5ra02309b. eCollection 2025 May 15.

A refined set of RxNorm drug names for enhancing unstructured data analysis in drug safety surveillance.

Exp Biol Med (Maywood). 2025 May 2;250:10374. doi: 10.3389/ebm.2025.10374. eCollection 2025.

Developing muscarinic receptor M1 classification models utilizing transfer learning and generative AI techniques.

Sci Rep. 2025 May 12;15(1):16486. doi: 10.1038/s41598-025-00972-w.

Predicting the Toxicity of Drug Molecules with Selecting Effective Descriptors Using a Binary Ant Colony Optimization (BACO) Feature Selection Approach.

Molecules. 2025 Mar 31;30(7):1548. doi: 10.3390/molecules30071548.

Developing predictive models for µ opioid receptor binding using machine learning and deep learning techniques.

Exp Biol Med (Maywood). 2025 Mar 19;250:10359. doi: 10.3389/ebm.2025.10359. eCollection 2025.

Development of a comprehensive open access "molecules with androgenic activity resource (MAAR)" to facilitate risk assessment of chemicals.

Exp Biol Med (Maywood). 2024 Sep 19;249:10279. doi: 10.3389/ebm.2024.10279. eCollection 2024.

A Chemical Structure and Machine Learning Approach to Assess the Potential Bioactivity of Endogenous Metabolites and Their Association with Early Childhood Systemic Inflammation.

Metabolites. 2024 May 10;14(5):278. doi: 10.3390/metabo14050278.

BERT-based language model for accurate drug adverse event extraction from social media: implementation, evaluation, and contributions to pharmacovigilance practices.

Front Public Health. 2024 Apr 23;12:1392180. doi: 10.3389/fpubh.2024.1392180. eCollection 2024.

A quantum-based oversampling method for classification of highly imbalanced and overlapped data.

Exp Biol Med (Maywood). 2023 Dec;248(24):2500-2513. doi: 10.1177/15353702231220665. Epub 2024 Jan 28.

本文引用的文献

Evolving Concept of Activity Cliffs.

ACS Omega. 2019 Aug 26;4(11):14360-14368. doi: 10.1021/acsomega.9b02221. eCollection 2019 Sep 10.

Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals With High-Throughput Cell-Based Androgen Receptor Bioassay Data.

Front Physiol. 2019 Aug 13;10:1044. doi: 10.3389/fphys.2019.01044. eCollection 2019.

Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery.

J Cheminform. 2019 Jan 10;11(1):4. doi: 10.1186/s13321-018-0325-4.

Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.

Front Chem. 2018 Aug 28;6:362. doi: 10.3389/fchem.2018.00362. eCollection 2018.

Development of estrogen receptor beta binding prediction model using large sets of chemicals.

Oncotarget. 2017 Oct 10;8(54):92989-93000. doi: 10.18632/oncotarget.21723. eCollection 2017 Nov 3.

ADMET Evaluation in Drug Discovery. 18. Reliable Prediction of Chemical-Induced Urinary Tract Toxicity by Boosting Machine Learning Approaches.

Mol Pharm. 2017 Nov 6;14(11):3935-3953. doi: 10.1021/acs.molpharmaceut.7b00631. Epub 2017 Oct 27.

Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets.

J Chem Inf Model. 2017 Jul 24;57(7):1591-1598. doi: 10.1021/acs.jcim.7b00159. Epub 2017 Jun 30.

Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric.

PLoS One. 2017 Jun 2;12(6):e0177678. doi: 10.1371/journal.pone.0177678. eCollection 2017.

Binary classification of imbalanced datasets using conformal prediction.

J Mol Graph Model. 2017 Mar;72:256-265. doi: 10.1016/j.jmgm.2017.01.008. Epub 2017 Jan 6.

Predictive Modeling of Estrogen Receptor Binding Agents Using Advanced Cheminformatics Tools and Massive Public Data.

Front Environ Sci. 2016 Mar;4. doi: 10.3389/fenvs.2016.00012. Epub 2016 Mar 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于结构-活性关系的高度不平衡Tox21数据集的化学分类

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

作者信息

Idakwo Gabriel, Thangapandian Sundar, Luttrell Joseph, Li Yan, Wang Nan, Zhou Zhaoxian, Hong Huixiao, Yang Bei, Zhang Chaoyang, Gong Ping

机构信息

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.

Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.

出版信息

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

DOI:10.1186/s13321-020-00468-x

PMID:33372637

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7592558/

Abstract

摘要

基于结构-活性关系的高度不平衡Tox21数据集的化学分类

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于结构-活性关系的高度不平衡Tox21数据集的化学分类

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献