使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.

作者信息

Garg Ayush, Ramamurthi Narayanan, Das Shyam Sundar

机构信息

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Noida 201303, India.

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Chennai 600113, India.

出版信息

J Chem Inf Model. 2025 Apr 28;65(8):3976-3989. doi: 10.1021/acs.jcim.5c00023. Epub 2025 Apr 15.

DOI:10.1021/acs.jcim.5c00023

PMID:40230275

Abstract

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

摘要

基于类别不平衡数据集构建的分类模型往往会优先考虑多数类别的准确性，因此，少数类别的误分类率通常较高。有多种技术可用于解决分类模型中的类别不平衡问题，这些技术可分为数据级、算法级和混合方法。但据我们所知，文献中尚无针对这些技术相对于类别比例的性能进行深入分析的内容。我们在本研究中解决了这些不足，并使用机器学习（ML）方法和自动机器学习（AutoML）工具对四种不同技术处理不平衡类别分布的性能进行了详细分析。为了开展我们的研究，我们选择了四种这样的技术：（a）使用（i）GHOST和（ii）精确率-召回率曲线下面积（AUPR）曲线进行阈值优化，（b）AutoML的内部平衡方法和机器学习方法的类别权重，以及（c）使用SMOTETomek进行数据平衡，并从属于药物发现和开发领域的三个数据集中生成了27个数据集，考虑了九个不同的类别比例（即正类与总样本的比例）。我们采用随机森林（RF）和支持向量机（SVM）作为ML分类器的代表，以及AutoGluon-Tabular（版本0.6.1）和H2O AutoML（版本3.40.0.4）作为AutoML工具的代表。我们研究的重要发现如下：（i）阈值优化对AUC和AUPR等排序指标没有影响，但AUC和AUPR会受到类别加权和SMOTTomek的影响；（ii）对于ML方法RF和SVM，在所有数据集上，F1分数、MCC和平衡准确率分别可实现高达375、33.33和450的显著百分比提升，这些指标适用于不平衡数据集的性能评估；（iii）对于AutoML库AutoGluon-Tabular和H2O AutoML，在所有数据集上，F1分数、MCC和平衡准确率分别可实现高达383.33、37.25和533.33的显著百分比提升；（iv）平衡准确率的百分比提升的一般模式是，当类别比例从0.5系统地降低到0.1时，百分比提升会增加；在F1分数和MCC的情况下，在类别比例为0.3时实现最大提升；（v）对于使用平衡方法的ML和AutoML，基于F1分数观察到，任何单个类别平衡技术在显著更多的数据集上并不优于所有其他方法；（vi）三种外部平衡技术相结合的性能优于ML和AutoML的内部平衡方法；（vii）AutoML工具的性能与ML模型相当，并且在某些情况下，当应用不平衡处理技术时，在处理不平衡分类方面表现甚至更好。总之，建议探索多种数据平衡技术来对不平衡数据集进行分类，以实现最佳性能，因为无论是外部技术还是内部技术都没有明显优于其他技术。结果特定于本研究中使用的ML方法和AutoML库，为了进行推广，可以考虑大量的ML方法和AutoML库开展研究。

相似文献

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

J Chem Inf Model. 2025 Apr 28;65(8):3976-3989. doi: 10.1021/acs.jcim.5c00023. Epub 2025 Apr 15.

Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees.用于无刺蜂多类不平衡形态测量数据的数据增强和机器学习算法

Heliyon. 2025 Jan 23;11(3):e42214. doi: 10.1016/j.heliyon.2025.e42214. eCollection 2025 Feb 15.

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略：以脑出血为例。

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

Stroke Prediction with Machine Learning Methods among Older Chinese.基于机器学习方法对中国老年人进行中风预测。

Int J Environ Res Public Health. 2020 Mar 12;17(6):1828. doi: 10.3390/ijerph17061828.

GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.调整决策阈值以处理机器学习中的不平衡数据。

J Chem Inf Model. 2021 Jun 28;61(6):2623-2640. doi: 10.1021/acs.jcim.1c00160. Epub 2021 Jun 8.

Improving Surgical Site Infection Prediction Using Machine Learning: Addressing Challenges of Highly Imbalanced Data.使用机器学习改善手术部位感染预测：应对高度不平衡数据的挑战。

Diagnostics (Basel). 2025 Feb 19;15(4):501. doi: 10.3390/diagnostics15040501.

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

Employing Automated Machine Learning (AutoML) Methods to Facilitate the ADMET Properties Prediction.采用自动机器学习（AutoML）方法促进药物代谢及毒性性质预测。

J Chem Inf Model. 2025 Apr 14;65(7):3215-3225. doi: 10.1021/acs.jcim.4c02122. Epub 2025 Mar 14.

Prediction and feature selection of low birth weight using machine learning algorithms.利用机器学习算法预测和选择低出生体重。

J Health Popul Nutr. 2024 Oct 12;43(1):157. doi: 10.1186/s41043-024-00647-8.

Deep Learning-Based Imbalanced Data Classification for Drug Discovery.基于深度学习的药物发现中不平衡数据分类。

J Chem Inf Model. 2020 Sep 28;60(9):4180-4190. doi: 10.1021/acs.jcim.9b01162. Epub 2020 Jul 8.

使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.

作者信息

Garg Ayush, Ramamurthi Narayanan, Das Shyam Sundar

机构信息

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Noida 201303, India.

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Chennai 600113, India.

出版信息

J Chem Inf Model. 2025 Apr 28;65(8):3976-3989. doi: 10.1021/acs.jcim.5c00023. Epub 2025 Apr 15.

DOI:10.1021/acs.jcim.5c00023

PMID:40230275

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.

作者信息

机构信息

出版信息

相似文献

使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.

作者信息

机构信息

出版信息

相似文献