Parkinson Jonathan, Wang Wei
J Chem Inf Model. 2023 Aug 14;63(15):4589-4601. doi: 10.1021/acs.jcim.3c00601. Epub 2023 Jul 27.
Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g., amino acid sequences) and graphs (e.g., small molecules). In this study, we introduce a group of random feature-approximated kernels for sequences and graphs that exhibit linear scaling with both the size of the training set and the size of the sequences or graphs. We incorporate these new kernels into our new Python library for GP regression, xGPR, and develop an efficient and scalable algorithm for fitting GPs equipped with these kernels to large datasets. We compare the performance of xGPR on 17 different benchmarks with both standard and state-of-the-art deep learning models and find that GP regression achieves highly competitive accuracy for these tasks while providing with well-calibrated uncertainty quantitation and improved interpretability. Finally, in a simple experiment, we illustrate how xGPR may be used as part of an active learning strategy to engineer a protein with a desired property in an automated way without human intervention.
高斯过程(GP)是一种贝叶斯模型,在机器学习的回归任务中具有诸多优势,比如能可靠地量化不确定性并提高可解释性。然而,其过高的计算成本以及难以适用于分析序列(如氨基酸序列)和图形(如小分子)的问题,阻碍了它们的应用。在本研究中,我们引入了一组用于序列和图形的随机特征近似核,这些核对于训练集大小以及序列或图形大小均呈现线性缩放。我们将这些新核纳入用于GP回归的新Python库xGPR中,并开发了一种高效且可扩展的算法,用于将配备这些核的GP拟合到大型数据集。我们将xGPR在17个不同基准测试上的性能与标准和先进的深度学习模型进行比较,发现GP回归在这些任务中实现了极具竞争力的准确率,同时提供了校准良好的不确定性量化和更高的可解释性。最后,在一个简单实验中,我们展示了xGPR如何作为主动学习策略的一部分,在无需人工干预的情况下以自动化方式设计具有所需特性的蛋白质。