Kim QHwan, Ko Joon-Hyuk, Kim Sunghoon, Park Nojun, Jhe Wonho
Department of Physics and Astronomy, Institute of Applied Physics, Seoul National University, Gwanak-gu, Seoul 08826, Republic of Korea.
Bioinformatics. 2021 Oct 25;37(20):3428-3435. doi: 10.1093/bioinformatics/btab346.
Characterizing drug-protein interactions (DPIs) is crucial to the high-throughput screening for drug discovery. The deep learning-based approaches have attracted attention because they can predict DPIs without human trial and error. However, because data labeling requires significant resources, the available protein data size is relatively small, which consequently decreases model performance. Here, we propose two methods to construct a deep learning framework that exhibits superior performance with a small labeled dataset.
At first, we use transfer learning in encoding protein sequences with a pretrained model, which trains general sequence representations in an unsupervised manner. Second, we use a Bayesian neural network to make a robust model by estimating the data uncertainty. Our resulting model performs better than the previous baselines at predicting interactions between molecules and proteins. We also show that the quantified uncertainty from the Bayesian inference is related to confidence and can be used for screening DPI data points.
The code is available at https://github.com/QHwan/PretrainDPI.
Supplementary data are available at Bioinformatics online.
表征药物-蛋白质相互作用(DPI)对于药物发现的高通量筛选至关重要。基于深度学习的方法因其能够在无需人为反复试验的情况下预测DPI而受到关注。然而,由于数据标注需要大量资源,可用的蛋白质数据规模相对较小,这进而降低了模型性能。在此,我们提出两种方法来构建一个深度学习框架,该框架在小标注数据集上表现出卓越性能。
首先,我们使用迁移学习,通过预训练模型对蛋白质序列进行编码,该模型以无监督方式训练通用序列表示。其次,我们使用贝叶斯神经网络,通过估计数据不确定性来构建一个稳健的模型。我们得到的模型在预测分子与蛋白质之间的相互作用方面比之前的数据基线表现更好。我们还表明,贝叶斯推理得出的量化不确定性与置信度相关,可用于筛选DPI数据点。
代码可在https://github.com/QHwan/PretrainDPI获取。
补充数据可在《生物信息学》在线获取。