Department of Interdisciplinary Informatics in the Kyushu Institute of Technology, Japan.
Tulane University, USA.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab228.
Viral infection involves a large number of protein-protein interactions (PPIs) between human and virus. The PPIs range from the initial binding of viral coat proteins to host membrane receptors to the hijacking of host transcription machinery. However, few interspecies PPIs have been identified, because experimental methods including mass spectrometry are time-consuming and expensive, and molecular dynamic simulation is limited only to the proteins whose 3D structures are solved. Sequence-based machine learning methods are expected to overcome these problems. We have first developed the LSTM model with word2vec to predict PPIs between human and virus, named LSTM-PHV, by using amino acid sequences alone. The LSTM-PHV effectively learnt the training data with a highly imbalanced ratio of positive to negative samples and achieved AUCs of 0.976 and 0.973 and accuracies of 0.984 and 0.985 on the training and independent datasets, respectively. In predicting PPIs between human and unknown or new virus, the LSTM-PHV learned greatly outperformed the existing state-of-the-art PPI predictors. Interestingly, learning of only sequence contexts as words is sufficient for PPI prediction. Use of uniform manifold approximation and projection demonstrated that the LSTM-PHV clearly distinguished the positive PPI samples from the negative ones. We presented the LSTM-PHV online web server and support data that are freely available at http://kurata35.bio.kyutech.ac.jp/LSTM-PHV.
病毒感染涉及人体和病毒之间大量的蛋白质-蛋白质相互作用(PPIs)。这些相互作用范围从病毒外壳蛋白与宿主膜受体的最初结合到宿主转录机制的劫持。然而,很少有种间 PPIs 被识别出来,因为包括质谱在内的实验方法既耗时又昂贵,而分子动力学模拟仅局限于 3D 结构已解决的蛋白质。基于序列的机器学习方法有望克服这些问题。我们首先开发了基于词向量的 LSTM 模型,仅使用氨基酸序列,命名为 LSTM-PHV,用于预测人体和病毒之间的 PPIs。LSTM-PHV 有效地学习了具有高度不平衡正样本和负样本比例的训练数据,在训练和独立数据集上的 AUC 分别为 0.976 和 0.973,准确率分别为 0.984 和 0.985。在预测人体和未知或新病毒之间的 PPIs 时,LSTM-PHV 的学习表现明显优于现有的最先进的 PPI 预测器。有趣的是,仅将序列上下文作为单词进行学习就足以进行 PPI 预测。使用一致流形逼近和投影表明,LSTM-PHV 清楚地区分了阳性 PPI 样本和阴性样本。我们提供了 LSTM-PHV 的在线网络服务器,并在 http://kurata35.bio.kyutech.ac.jp/LSTM-PHV 上免费提供支持数据。