基于深度学习的药物发现中不平衡数据分类。

Deep Learning-Based Imbalanced Data Classification for Drug Discovery.

机构信息

Trakya University Faculty of Medicine, Department of Biostatistics and Medical Informatics, Edirne, Turkey.

出版信息

J Chem Inf Model. 2020 Sep 28;60(9):4180-4190. doi: 10.1021/acs.jcim.9b01162. Epub 2020 Jul 8.

DOI:10.1021/acs.jcim.9b01162

Abstract

Drug discovery studies have become increasingly expensive and time-consuming processes. In the early phase of drug discovery studies, an extensive search has been performed to find drug-like compounds, which then can be optimized over time to become a marketed drug. One of the conventional ways of detecting active compounds is to perform an HTS (high-throughput screening) experiment. As of July 2019, the PubChem repository contains 1.3 million bioassays that are generated through HTS experiments. This feature of PubChem makes it a great resource for performing machine learning algorithms to develop classification models to detect active compounds for drug discovery studies. However, data sets obtained from PubChem are highly imbalanced. This imbalanced nature of the data sets has a negative impact on the classification performance of machine learning algorithms. Here, we explored the classification performance of deep neural networks (DNN) on imbalance compound data sets after applying various data balancing methods. We used five confirmatory HTS bioassays from the PubChem repository and applied one undersampling and three oversampling methods as data balancing methods. We used a fully connected, two-hidden-layer DNN model for the classification of active and inactive molecules. To evaluate the performance of the network, we calculated six performance metrics, including balanced accuracy, precision, recall, F1 score, Matthews correlation coefficient, and area under the ROC curve. The study results showed that the effect of imbalanced data on network performance could be mitigated to a degree by applying the data balancing methods. The level of imbalance, however, has a negative effect on the performance of the network.

摘要

药物发现研究已经成为一个日益昂贵和耗时的过程。在药物发现研究的早期阶段，已经进行了广泛的搜索，以寻找类似药物的化合物，然后可以随着时间的推移进行优化，成为一种上市药物。检测活性化合物的一种传统方法是进行高通量筛选（HTS）实验。截至 2019 年 7 月，PubChem 存储库包含 130 万个通过 HTS 实验生成的生物测定。PubChem 的这一特点使其成为执行机器学习算法的绝佳资源，以开发分类模型来检测药物发现研究中的活性化合物。然而，从 PubChem 获得的数据集中高度不平衡。这种数据集的不平衡性质对机器学习算法的分类性能有负面影响。在这里，我们在应用各种数据平衡方法后，研究了深度神经网络（DNN）在不平衡化合物数据集上的分类性能。我们使用了来自 PubChem 存储库的五个确认性 HTS 生物测定，并应用了一种欠采样和三种过采样方法作为数据平衡方法。我们使用全连接的、具有两个隐藏层的 DNN 模型来对活性和非活性分子进行分类。为了评估网络的性能，我们计算了六个性能指标，包括平衡准确性、精度、召回率、F1 分数、马修斯相关系数和 ROC 曲线下的面积。研究结果表明，通过应用数据平衡方法，可以在一定程度上减轻不平衡数据对网络性能的影响。然而，不平衡的程度对网络的性能有负面影响。

相似文献

Deep Learning-Based Imbalanced Data Classification for Drug Discovery.

J Chem Inf Model. 2020 Sep 28;60(9):4180-4190. doi: 10.1021/acs.jcim.9b01162. Epub 2020 Jul 8.

Binding Activity Classification of Anti-SARS-CoV-2 Molecules using Deep Learning Across Multiple Assays.

Balkan Med J. 2024 May 3;41(3):186-192. doi: 10.4274/balkanmedj.galenos.2024.2024-1-73. Epub 2024 Mar 11.

Investigation of Machine Intelligence in Compound Cell Activity Classification.

Mol Pharm. 2019 Nov 4;16(11):4472-4484. doi: 10.1021/acs.molpharmaceut.9b00558. Epub 2019 Oct 21.

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery.

Mol Pharm. 2021 Jan 4;18(1):403-415. doi: 10.1021/acs.molpharmaceut.0c01013. Epub 2020 Dec 16.

Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets.

Mol Pharm. 2017 Dec 4;14(12):4462-4475. doi: 10.1021/acs.molpharmaceut.7b00578. Epub 2017 Nov 13.

Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction.

Molecules. 2023 Feb 9;28(4):1663. doi: 10.3390/molecules28041663.

QPoweredCompound2DeNovoDrugPropMax - a novel programmatic tool incorporating deep learning and methods for automated in silico bio-activity discovery for any compound of interest.

J Biomol Struct Dyn. 2023 Mar;41(5):1790-1797. doi: 10.1080/07391102.2021.2024450. Epub 2022 Jan 10.

Boosting compound-protein interaction prediction by deep learning.

Methods. 2016 Nov 1;110:64-72. doi: 10.1016/j.ymeth.2016.06.024. Epub 2016 Jul 1.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

引用本文的文献

Adjusted imbalance ratio leads to effective AI-based drug discovery against infectious disease.

Sci Rep. 2025 Aug 12;15(1):29563. doi: 10.1038/s41598-025-15265-5.

The Use of Selected Machine Learning Methods in Dairy Cattle Farming: A Review.

Animals (Basel). 2025 Jul 10;15(14):2033. doi: 10.3390/ani15142033.

SMOTE algorithm optimization and application in corporate credit risk prediction with diversification strategy consideration.

Sci Rep. 2025 Jul 2;15(1):23598. doi: 10.1038/s41598-025-09173-x.

Identification of Food-Derived Electrophilic Chalcones as Nrf2 Activators Using Comprehensive Virtual Screening Techniques.

Antioxidants (Basel). 2025 Apr 30;14(5):546. doi: 10.3390/antiox14050546.

Developing muscarinic receptor M1 classification models utilizing transfer learning and generative AI techniques.

Sci Rep. 2025 May 12;15(1):16486. doi: 10.1038/s41598-025-00972-w.

A review of machine learning methods for imbalanced data challenges in chemistry.

Chem Sci. 2025 Apr 22;16(18):7637-7658. doi: 10.1039/d5sc00270b. eCollection 2025 May 7.

Deepmol: an automated machine and deep learning framework for computational chemistry.

J Cheminform. 2024 Dec 5;16(1):136. doi: 10.1186/s13321-024-00937-7.

cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research.

J Cheminform. 2024 Nov 28;16(1):134. doi: 10.1186/s13321-024-00929-7.

Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery.

Int J Mol Sci. 2024 Jul 28;25(15):8239. doi: 10.3390/ijms25158239.

Machine Learning Assisted Hit Prioritization for High Throughput Screening in Drug Discovery.

ACS Cent Sci. 2024 Mar 15;10(4):823-832. doi: 10.1021/acscentsci.3c01517. eCollection 2024 Apr 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于深度学习的药物发现中不平衡数据分类。

Deep Learning-Based Imbalanced Data Classification for Drug Discovery.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献