School of Informatics and Computing, Indiana University at Bloomington, Bloomington, Indiana 47405, USA.
J Chem Inf Model. 2012 Mar 26;52(3):792-803. doi: 10.1021/ci200615h. Epub 2012 Mar 8.
Random forest is currently considered one of the best QSAR methods available in terms of accuracy of prediction. However, it is computationally intensive. Naïve Bayes is a simple, robust classification method. The Laplacian-modified Naïve Bayes implementation is the preferred QSAR method in the widely used commercial chemoinformatics platform Pipeline Pilot. We made a comparison of the ability of Pipeline Pilot Naïve Bayes (PLPNB) and random forest to make accurate predictions on 18 large, diverse in-house QSAR data sets. These include on-target and ADME-related activities. These data sets were set up as classification problems with either binary or multicategory activities. We used a time-split method of dividing training and test sets, as we feel this is a realistic way of simulating prospective prediction. PLPNB is computationally efficient. However, random forest predictions are at least as good and in many cases significantly better than those of PLPNB on our data sets. PLPNB performs better with ECFP4 and ECFP6 descriptors, which are native to Pipeline Pilot, and more poorly with other descriptors we tried.
随机森林目前被认为是预测准确性方面最好的 QSAR 方法之一。然而,它的计算量很大。朴素贝叶斯是一种简单、稳健的分类方法。拉普拉斯修正的朴素贝叶斯实现是广泛使用的商业化学信息学平台 Pipeline Pilot 中首选的 QSAR 方法。我们比较了 Pipeline Pilot 朴素贝叶斯 (PLPNB) 和随机森林在 18 个大型、多样的内部 QSAR 数据集上进行准确预测的能力。这些数据集包括针对目标和 ADME 相关活动的数据集。这些数据集被设置为具有二进制或多类别活动的分类问题。我们使用时间分割方法将训练集和测试集分开,因为我们认为这是模拟前瞻性预测的一种现实方式。PLPNB 的计算效率很高。然而,在我们的数据集上,随机森林的预测至少与 PLPNB 一样好,在许多情况下甚至明显更好。PLPNB 在 Pipeline Pilot 原生的 ECFP4 和 ECFP6 描述符上表现更好,而在我们尝试的其他描述符上表现更差。