Department of Electrical and Computer Engineering, College Station, TX, USA.
TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, College Station, TX, USA.
Bioinformatics. 2019 Sep 15;35(18):3329-3338. doi: 10.1093/bioinformatics/btz111.
Drug discovery demands rapid quantification of compound-protein interaction (CPI). However, there is a lack of methods that can predict compound-protein affinity from sequences alone with high applicability, accuracy and interpretability.
We present a seamless integration of domain knowledges and learning-based approaches. Under novel representations of structurally annotated protein sequences, a semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting affinities. Our representations and models outperform conventional options in achieving relative error in IC50 within 5-fold for test cases and 20-fold for protein classes not included for training. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, separate and joint attention mechanisms are developed and embedded to our model to add to its interpretability, as illustrated in case studies for predicting and explaining selective drug-target interactions. Lastly, alternative representations using protein sequences or compound graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead.
Data and source codes are available at https://github.com/Shen-Lab/DeepAffinity.
Supplementary data are available at Bioinformatics online.
药物发现需要快速定量化合物-蛋白质相互作用(CPI)。然而,缺乏能够仅从序列准确且可解释地预测化合物-蛋白质亲和力的方法,具有高适用性、准确性和可解释性。
我们提出了一种领域知识和基于学习的方法的无缝集成。在结构注释的蛋白质序列的新颖表示下,提出了一种统一递归和卷积神经网络的半监督深度学习模型,以利用未标记和标记数据,共同编码分子表示并预测亲和力。我们的表示和模型在实现相对误差方面优于传统方法,在测试案例中达到 IC50 的 5 倍以内,在未包含用于训练的蛋白质类别的 20 倍以内。通过迁移学习进一步提高了具有少量标记数据的新蛋白质类别的性能。此外,还开发并嵌入了单独和联合注意机制到我们的模型中,以提高其可解释性,如在预测和解释选择性药物-靶标相互作用的案例研究中所示。最后,还探索了使用蛋白质序列或化合物图的替代表示以及使用图卷积神经网络(GCNN)的统一 RNN/GCNN-CNN 模型,以揭示未来的算法挑战。
数据和源代码可在 https://github.com/Shen-Lab/DeepAffinity 上获得。
补充数据可在生物信息学在线获得。