Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland.
T5 Informatics GmbH, Spalenring 11, 4055 Basel, Switzerland.
J Chem Inf Model. 2021 Jun 28;61(6):2623-2640. doi: 10.1021/acs.jcim.1c00160. Epub 2021 Jun 8.
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
基于类别不平衡数据训练的机器学习分类器容易过度预测多数类别。这导致少数类别(在许多实际应用中是感兴趣的类别)的错误分类率更高。对于二进制数据,分类阈值默认为 0.5,但对于不平衡数据,这通常不是理想的。调整决策阈值是处理类别不平衡问题的一种有效策略。在这项工作中,我们提出了两种用于选择不平衡分类最优决策阈值的自动化程序。我们的程序的一个主要优势是它们不需要重新训练机器学习模型或重新采样训练数据。第一种方法是特定于随机森林 (RF) 的,而第二种方法,名为 GHOST,可以潜在地应用于任何机器学习分类器。我们在包含各种药物靶标结构活性数据的 138 个公共药物发现数据集上测试了这些程序。我们表明,两种阈值方法都显著提高了 RF 的性能。我们测试了将 GHOST 与四种不同的分类器结合使用两种分子描述符的情况,发现大多数分类器都受益于阈值优化。GHOST 也优于其他策略,包括随机欠采样和保形预测。最后,我们表明我们的阈值处理程序可以有效地应用于实际的药物发现项目,其中训练集和测试集之间的数据不平衡和特征差异很大。