Suppr超能文献

使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.

作者信息

Garg Ayush, Ramamurthi Narayanan, Das Shyam Sundar

机构信息

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Noida 201303, India.

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Chennai 600113, India.

出版信息

J Chem Inf Model. 2025 Apr 28;65(8):3976-3989. doi: 10.1021/acs.jcim.5c00023. Epub 2025 Apr 15.

Abstract

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

摘要

基于类别不平衡数据集构建的分类模型往往会优先考虑多数类别的准确性,因此,少数类别的误分类率通常较高。有多种技术可用于解决分类模型中的类别不平衡问题,这些技术可分为数据级、算法级和混合方法。但据我们所知,文献中尚无针对这些技术相对于类别比例的性能进行深入分析的内容。我们在本研究中解决了这些不足,并使用机器学习(ML)方法和自动机器学习(AutoML)工具对四种不同技术处理不平衡类别分布的性能进行了详细分析。为了开展我们的研究,我们选择了四种这样的技术:(a)使用(i)GHOST和(ii)精确率-召回率曲线下面积(AUPR)曲线进行阈值优化,(b)AutoML的内部平衡方法和机器学习方法的类别权重,以及(c)使用SMOTETomek进行数据平衡,并从属于药物发现和开发领域的三个数据集中生成了27个数据集,考虑了九个不同的类别比例(即正类与总样本的比例)。我们采用随机森林(RF)和支持向量机(SVM)作为ML分类器的代表,以及AutoGluon-Tabular(版本0.6.1)和H2O AutoML(版本3.40.0.4)作为AutoML工具的代表。我们研究的重要发现如下:(i)阈值优化对AUC和AUPR等排序指标没有影响,但AUC和AUPR会受到类别加权和SMOTTomek的影响;(ii)对于ML方法RF和SVM,在所有数据集上,F1分数、MCC和平衡准确率分别可实现高达375、33.33和450的显著百分比提升,这些指标适用于不平衡数据集的性能评估;(iii)对于AutoML库AutoGluon-Tabular和H2O AutoML,在所有数据集上,F1分数、MCC和平衡准确率分别可实现高达383.33、37.25和533.33的显著百分比提升;(iv)平衡准确率的百分比提升的一般模式是,当类别比例从0.5系统地降低到0.1时,百分比提升会增加;在F1分数和MCC的情况下,在类别比例为0.3时实现最大提升;(v)对于使用平衡方法的ML和AutoML,基于F1分数观察到,任何单个类别平衡技术在显著更多的数据集上并不优于所有其他方法;(vi)三种外部平衡技术相结合的性能优于ML和AutoML的内部平衡方法;(vii)AutoML工具的性能与ML模型相当,并且在某些情况下,当应用不平衡处理技术时,在处理不平衡分类方面表现甚至更好。总之,建议探索多种数据平衡技术来对不平衡数据集进行分类,以实现最佳性能,因为无论是外部技术还是内部技术都没有明显优于其他技术。结果特定于本研究中使用的ML方法和AutoML库,为了进行推广,可以考虑大量的ML方法和AutoML库开展研究。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验