Computational and Structural Chemistry, Merck & Company, Inc., West Point, Pennsylvania 19486, United States.
Computational and Structural Chemistry, Merck & Company, Inc., Kenilworth, New Jersey 07033, United States.
J Chem Inf Model. 2020 Oct 26;60(10):4653-4663. doi: 10.1021/acs.jcim.0c00678. Epub 2020 Oct 6.
While Gaussian process models are typically restricted to smaller data sets, we propose a variation which extends its applicability to the larger data sets common in the industrial drug discovery space, making it relatively novel in the quantitative structure-activity relationship (QSAR) field. By incorporating locality-sensitive hashing for fast nearest neighbor searches, the nearest neighbor Gaussian process model makes predictions with time complexity that is sub-linear with the sample size. The model can be efficiently built, permitting rapid updates to prevent degradation as new data is collected. Given its small number of hyperparameters, it is robust against overfitting and generalizes about as well as other common QSAR models. Like the usual Gaussian process model, it natively produces principled and well-calibrated uncertainty estimates on its predictions. We compare this new model with implementations of random forest, light gradient boosting, and -nearest neighbors to highlight these promising advantages. The code for the nearest neighbor Gaussian process is available at https://github.com/Merck/nngp.
虽然高斯过程模型通常仅限于较小的数据集,但我们提出了一种变体,将其应用范围扩展到工业药物发现领域中常见的更大数据集,这使得它在定量构效关系(QSAR)领域相对新颖。通过为快速最近邻搜索引入局部敏感哈希,最近邻高斯过程模型可以以与样本大小呈次线性关系的时间复杂度进行预测。该模型可以高效构建,允许快速更新以防止随着新数据的收集而降级。由于其超参数数量较少,因此它不易过度拟合,并且泛化能力与其他常见的 QSAR 模型相当。与通常的高斯过程模型一样,它可以对其预测结果自然地生成有原则且校准良好的不确定性估计。我们将这个新模型与随机森林、轻梯度提升和 -近邻的实现进行了比较,以突出这些有前途的优势。最近邻高斯过程的代码可在 https://github.com/Merck/nngp 上获得。