Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.
Computational Toxicology Group, CMS, R&D Platform Technology & Science, GSK, Park Road, Ware, Hertfordshire, SG12 0DP, UK.
J Comput Aided Mol Des. 2018 May;32(5):583-590. doi: 10.1007/s10822-018-0116-z. Epub 2018 Apr 6.
Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies.
在分类问题中使用的化学信息学数据集,特别是那些与生物或物理化学性质相关的数据集,通常是不平衡的。这在开发基于计算机的预测模型时提出了一个重大挑战,因为众所周知,传统的机器学习算法在平衡数据集中效果最佳。由于算法偏向于多数类,因此类不平衡会导致这些算法的性能出现偏差。在这里,我们比较了七种不同的元分类器的性能,以评估它们处理不平衡数据集的能力,其中随机森林被用作基分类器。为此目的,选择了四个直接(胆汁淤积)或间接(通过抑制有机阴离子转运多肽 1B1 和 1B3)与肝毒性相关的数据集。这些数据集中负类和正类的不平衡比分别在 4:1 到 20:1 之间。为了开发模型,使用了三组不同的分子描述符,并在 10 倍交叉验证和独立验证集上评估了它们的性能。分层装袋、MetaCost 和 CostSensitiveClassifier 被发现是所有方法中表现最好的。虽然 MetaCost 和 CostSensitiveClassifier 提供了更好的敏感性值,但分层装袋导致了较高的平衡准确率。