Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Poland.
Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Poland ; Faculty of Chemistry, Jagiellonian University, R. Ingardena 3, 30-060 Kraków, Poland.
J Cheminform. 2014 Jun 11;6:32. doi: 10.1186/1758-2946-6-32. eCollection 2014.
The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.
The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.
In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
本文深入分析了负例训练数量对机器学习方法性能的影响。
本文研究了从 ZINC 数据库中随机选择的固定数量正例和数量可变的负例集合中,应用机器学习方法时这一相当被忽视的方面的影响。在模拟虚拟筛选实验中,我们发现正例与负例训练实例的比例增加会极大地影响大多数被调查的 ML 方法的评估参数。在大多数情况下,精度和 MCC 都有显著提高,而命中率有所下降。对这些变化的动态分析使我们能够推荐出最佳的训练数据组合。本研究在几个蛋白质靶标上进行,使用了 5 种机器学习算法(SMO、朴素贝叶斯、Ibk、J48 和随机森林)和 2 种分子指纹(MACCS 和 CDK FP)。CDK FP 与 SMO 或随机森林算法相结合的分类效果最佳。朴素贝叶斯模型似乎对训练集中负例数量的变化不太敏感。
总之,在准备机器学习实验时应考虑正例与负例训练实例的比例,因为它可能会显著影响特定分类器的性能。此外,还可以将负例训练集大小的优化作为基于机器学习的虚拟筛选中的一种提升方法。