Ma Junshui, Sheridan Robert P, Liaw Andy, Dahl George E, Svetnik Vladimir
Biometrics Research Department and ‡Structural Chemistry Department, Merck Research Laboratories , Rahway, New Jersey 07065, United States.
J Chem Inf Model. 2015 Feb 23;55(2):263-74. doi: 10.1021/ci500747n. Epub 2015 Feb 17.
Neural networks were widely used for quantitative structure-activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, more efficient training algorithms, and advancements in computer hardware. In particular, deep neural nets (DNNs), i.e. neural nets with more than one hidden layer, have found great successes in many applications, such as computer vision and natural language processing. Here we show that DNNs can routinely make better prospective predictions than RF on a set of large diverse QSAR data sets that are taken from Merck's drug discovery effort. The number of adjustable parameters needed for DNNs is fairly large, but our results show that it is not necessary to optimize them for individual data sets, and a single set of recommended parameters can achieve better performance than RF for most of the data sets we studied. The usefulness of the parameters is demonstrated on additional data sets not used in the calibration. Although training DNNs is still computationally intensive, using graphical processing units (GPUs) can make this issue manageable.
神经网络在20世纪90年代被广泛用于定量构效关系(QSAR)研究。由于各种实际问题(例如,处理大问题时速度慢、难以训练、容易过拟合等),它们在21世纪初被支持向量机(SVM)和随机森林(RF)等更强大的方法所取代。在过去十年中,由于防止过拟合的新方法、更高效的训练算法以及计算机硬件的进步,神经网络在机器学习领域得以复兴。特别是深度神经网络(DNN),即具有多个隐藏层的神经网络,在许多应用中都取得了巨大成功,如计算机视觉和自然语言处理。在此我们表明,在从默克公司药物发现工作中获取的一组多样化的大型QSAR数据集上,DNN通常能比RF做出更好的前瞻性预测。DNN所需的可调参数数量相当大,但我们的结果表明,无需针对单个数据集对其进行优化,对于我们研究的大多数数据集,一组推荐参数就能比RF实现更好的性能。这些参数在未用于校准的其他数据集上也证明了其有效性。尽管训练DNN在计算上仍然很密集,但使用图形处理单元(GPU)可以使这个问题得到控制。