Suppr超能文献

比较元分类器的性能——以与预测肝毒性相关的选定不平衡数据集为例的研究。

Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity.

机构信息

Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.

Computational Toxicology Group, CMS, R&D Platform Technology & Science, GSK, Park Road, Ware, Hertfordshire, SG12 0DP, UK.

出版信息

J Comput Aided Mol Des. 2018 May;32(5):583-590. doi: 10.1007/s10822-018-0116-z. Epub 2018 Apr 6.

Abstract

Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies.

摘要

在分类问题中使用的化学信息学数据集,特别是那些与生物或物理化学性质相关的数据集,通常是不平衡的。这在开发基于计算机的预测模型时提出了一个重大挑战,因为众所周知,传统的机器学习算法在平衡数据集中效果最佳。由于算法偏向于多数类,因此类不平衡会导致这些算法的性能出现偏差。在这里,我们比较了七种不同的元分类器的性能,以评估它们处理不平衡数据集的能力,其中随机森林被用作基分类器。为此目的,选择了四个直接(胆汁淤积)或间接(通过抑制有机阴离子转运多肽 1B1 和 1B3)与肝毒性相关的数据集。这些数据集中负类和正类的不平衡比分别在 4:1 到 20:1 之间。为了开发模型,使用了三组不同的分子描述符,并在 10 倍交叉验证和独立验证集上评估了它们的性能。分层装袋、MetaCost 和 CostSensitiveClassifier 被发现是所有方法中表现最好的。虽然 MetaCost 和 CostSensitiveClassifier 提供了更好的敏感性值,但分层装袋导致了较高的平衡准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e2a/5919997/a7a663644d2f/10822_2018_116_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验