比较元分类器的性能——以与预测肝毒性相关的选定不平衡数据集为例的研究。

Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity.

机构信息

Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.

Computational Toxicology Group, CMS, R&D Platform Technology & Science, GSK, Park Road, Ware, Hertfordshire, SG12 0DP, UK.

出版信息

J Comput Aided Mol Des. 2018 May;32(5):583-590. doi: 10.1007/s10822-018-0116-z. Epub 2018 Apr 6.

DOI:10.1007/s10822-018-0116-z

PMID:29626291

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5919997/

Abstract

Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies.

摘要

在分类问题中使用的化学信息学数据集，特别是那些与生物或物理化学性质相关的数据集，通常是不平衡的。这在开发基于计算机的预测模型时提出了一个重大挑战，因为众所周知，传统的机器学习算法在平衡数据集中效果最佳。由于算法偏向于多数类，因此类不平衡会导致这些算法的性能出现偏差。在这里，我们比较了七种不同的元分类器的性能，以评估它们处理不平衡数据集的能力，其中随机森林被用作基分类器。为此目的，选择了四个直接（胆汁淤积）或间接（通过抑制有机阴离子转运多肽 1B1 和 1B3）与肝毒性相关的数据集。这些数据集中负类和正类的不平衡比分别在 4:1 到 20:1 之间。为了开发模型，使用了三组不同的分子描述符，并在 10 倍交叉验证和独立验证集上评估了它们的性能。分层装袋、MetaCost 和 CostSensitiveClassifier 被发现是所有方法中表现最好的。虽然 MetaCost 和 CostSensitiveClassifier 提供了更好的敏感性值，但分层装袋导致了较高的平衡准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e2a/5919997/a7a663644d2f/10822_2018_116_Fig1_HTML.jpg

相似文献

Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity.比较元分类器的性能——以与预测肝毒性相关的选定不平衡数据集为例的研究。

J Comput Aided Mol Des. 2018 May;32(5):583-590. doi: 10.1007/s10822-018-0116-z. Epub 2018 Apr 6.

Combining Data with Predictions for Modeling Hepatic Steatosis by Using Stratified Bagging and Conformal Prediction.使用分层袋装法和一致性预测对肝脂肪变性进行建模的数据分析与预测相结合。

Chem Res Toxicol. 2021 Feb 15;34(2):656-668. doi: 10.1021/acs.chemrestox.0c00511. Epub 2020 Dec 21.

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在（放化疗）治疗结果预测中的应用：分类器的实证比较。

Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.

Class prediction for high-dimensional class-imbalanced data.高维类别不平衡数据的类别预测。

BMC Bioinformatics. 2010 Oct 20;11:523. doi: 10.1186/1471-2105-11-523.

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.评估和缓解机器学习中类不平衡的影响及其在 X 射线成像中的应用。

Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。

J Chem Inf Model. 2025 Apr 28;65(8):3976-3989. doi: 10.1021/acs.jcim.5c00023. Epub 2025 Apr 15.

GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.调整决策阈值以处理机器学习中的不平衡数据。

J Chem Inf Model. 2021 Jun 28;61(6):2623-2640. doi: 10.1021/acs.jcim.1c00160. Epub 2021 Jun 8.

Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle.机器学习算法、公牛遗传信息和不平衡数据集用于伊朗荷斯坦奶牛流产发生率预测模型。

Prev Vet Med. 2020 Feb;175:104869. doi: 10.1016/j.prevetmed.2019.104869. Epub 2019 Dec 17.

Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data.不要被类别不平衡问题困扰：选择合适的分类器和性能指标，对不平衡数据进行脑解码。

Neuroimage. 2023 Aug 15;277:120253. doi: 10.1016/j.neuroimage.2023.120253. Epub 2023 Jun 28.

引用本文的文献

MolToxPred: small molecule toxicity prediction using machine learning approach.MolToxPred：使用机器学习方法进行小分子毒性预测。

RSC Adv. 2024 Jan 30;14(6):4201-4220. doi: 10.1039/d3ra07322j. eCollection 2024 Jan 23.

Integrating cell morphology with gene expression and chemical structure to aid mitochondrial toxicity detection.将细胞形态与基因表达和化学结构相结合，以辅助线粒体毒性检测。

Commun Biol. 2022 Aug 23;5(1):858. doi: 10.1038/s42003-022-03763-5.

Allosteric Binders of ACE2 Are Promising Anti-SARS-CoV-2 Agents.血管紧张素转换酶2的变构结合剂是有前景的抗新型冠状病毒2药物。

ACS Pharmacol Transl Sci. 2022 Jun 22;5(7):468-478. doi: 10.1021/acsptsci.2c00049. eCollection 2022 Jul 8.

Allosteric binders of ACE2 are promising anti-SARS-CoV-2 agents.血管紧张素转换酶2（ACE2）的变构结合剂是很有前景的抗新型冠状病毒2（SARS-CoV-2）药物。

bioRxiv. 2022 Mar 16:2022.03.15.484484. doi: 10.1101/2022.03.15.484484.

Predicting Antifouling Activity and Acetylcholinesterase Inhibition of Marine-Derived Compounds Using a Computer-Aided Drug Design Approach.运用计算机辅助药物设计方法预测海洋来源化合物的抗污活性和乙酰胆碱酯酶抑制活性。

Mar Drugs. 2022 Feb 8;20(2):129. doi: 10.3390/md20020129.

Hybrid Approach Reveals Novel Inhibitors of Multiple SARS-CoV-2 Variants.混合方法揭示了多种新冠病毒变体的新型抑制剂。

ACS Pharmacol Transl Sci. 2021 Sep 17;4(5):1675-1688. doi: 10.1021/acsptsci.1c00176. eCollection 2021 Oct 8.

Development of Robust Quantitative Structure-Activity Relationship Models for CYP2C9, CYP2D6, and CYP3A4 Catalysis and Inhibition.开发稳健的 CYP2C9、CYP2D6 和 CYP3A4 催化和抑制的定量构效关系模型。

Drug Metab Dispos. 2021 Sep;49(9):822-832. doi: 10.1124/dmd.120.000320. Epub 2021 Jun 28.

Large-Scale Modeling of Multispecies Acute Toxicity End Points Using Consensus of Multitask Deep Learning Methods.采用多任务深度学习方法共识对多种物种急性毒性终点进行大规模建模。

J Chem Inf Model. 2021 Feb 22;61(2):653-663. doi: 10.1021/acs.jcim.0c01164. Epub 2021 Feb 3.

The good, the bad, and the ugly in chemical and biological data for machine learning.机器学习中化学和生物数据的优劣与不足。

Drug Discov Today Technol. 2019 Dec;32-33:3-8. doi: 10.1016/j.ddtec.2020.07.001. Epub 2020 Jul 26.

Chem Res Toxicol. 2021 Feb 15;34(2):656-668. doi: 10.1021/acs.chemrestox.0c00511. Epub 2020 Dec 21.

本文引用的文献

Multivariate binary classification of imbalanced datasets-A case study based on high-dimensional multiplex autoimmune assay data.不平衡数据集的多变量二元分类——基于高维多重自身免疫分析数据的案例研究

Biom J. 2017 Sep;59(5):948-966. doi: 10.1002/bimj.201600207. Epub 2017 Jun 19.

Predicting Drug-Induced Cholestasis with the Help of Hepatic Transporters-An in Silico Modeling Approach.借助肝脏转运体预测药物性胆汁淤积——一种计算机模拟建模方法

J Chem Inf Model. 2017 Mar 27;57(3):608-615. doi: 10.1021/acs.jcim.6b00518. Epub 2017 Mar 8.

Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.用于对存在缺失值的医疗保健数据进行分类的多级加权支持向量机

PLoS One. 2016 May 19;11(5):e0155119. doi: 10.1371/journal.pone.0155119. eCollection 2016.

Computational Models for Human and Animal Hepatotoxicity with a Global Application Scope.具有全球应用范围的人类和动物肝毒性计算模型。

Chem Res Toxicol. 2016 May 16;29(5):757-67. doi: 10.1021/acs.chemrestox.5b00465. Epub 2016 Apr 6.

The SIDER database of drugs and side effects.药物与副作用的SIDER数据库。

Nucleic Acids Res. 2016 Jan 4;44(D1):D1075-9. doi: 10.1093/nar/gkv1075. Epub 2015 Oct 19.

Identification of Novel Inhibitors of Organic Anion Transporting Polypeptides 1B1 and 1B3 (OATP1B1 and OATP1B3) Using a Consensus Vote of Six Classification Models.使用六种分类模型的共识投票鉴定新型有机阴离子转运多肽1B1和1B3（OATP1B1和OATP1B3）抑制剂

Mol Pharm. 2015 Dec 7;12(12):4395-404. doi: 10.1021/acs.molpharmaceut.5b00583. Epub 2015 Nov 2.

Rule-based classification models of molecular autofluorescence.基于规则的分子自发荧光分类模型。

J Chem Inf Model. 2015 Feb 23;55(2):434-45. doi: 10.1021/ci5007432. Epub 2015 Feb 9.

ProTox: a web server for the in silico prediction of rodent oral toxicity.ProTox：一个用于计算机预测啮齿动物口服毒性的网络服务器。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W53-8. doi: 10.1093/nar/gku401. Epub 2014 May 16.

QSAR modeling of imbalanced high-throughput screening data in PubChem.基于PubChem中不平衡高通量筛选数据的定量构效关系建模

J Chem Inf Model. 2014 Mar 24;54(3):705-12. doi: 10.1021/ci400737s. Epub 2014 Feb 28.

Classification of hepatotoxicants using HepG2 cells: A proof of principle study.使用HepG2细胞对肝毒性物质进行分类：一项原理验证研究。

Chem Res Toxicol. 2014 Mar 17;27(3):433-42. doi: 10.1021/tx4004165. Epub 2014 Jan 28.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

比较元分类器的性能——以与预测肝毒性相关的选定不平衡数据集为例的研究。

Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献