College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China.
Department of Artificial Intelligence, Polytechnical University of Madrid, Madrid, Spain.
BMC Genomics. 2022 Aug 4;23(1):555. doi: 10.1186/s12864-022-08772-6.
Protein-protein interaction (PPI) is very important for many biochemical processes. Therefore, accurate prediction of PPI can help us better understand the role of proteins in biochemical processes. Although there are many methods to predict PPI in biology, they are time-consuming and lack accuracy, so it is necessary to build an efficiently and accurately computational model in the field of PPI prediction.
We present a novel sequence-based computational approach called DCSE (Double-Channel-Siamese-Ensemble) to predict potential PPI. In the encoding layer, we treat each amino acid as a word, and map it into an N-dimensional vector. In the feature extraction layer, we extract features from local and global perspectives by Multilayer Convolutional Neural Network (MCN) and Multilayer Bidirectional Gated Recurrent Unit with Convolutional Neural Networks (MBC). Finally, the output of the feature extraction layer is then fed into the prediction layer to output whether the input protein pair will interact each other. The MCN and MBC are siamese and ensemble based network, which can effectively improve the performance of the model. In order to demonstrate our model's performance, we compare it with four machine learning based and three deep learning based models. The results show that our method outperforms other models in all evaluation criteria. The Accuracy, Precision, [Formula: see text], Recall and MCC of our model are 0.9303, 0.9091, 0.9268, 0.9452, 0.8609. For the other seven models, the highest Accuracy, Precision, [Formula: see text], Recall and MCC are 0.9288, 0.9243, 0.9246, 0.9250, 0.8572. We also test our model in the imbalanced dataset and transfer our model to another species. The results show our model is excellent.
Our model achieves the best performance by comparing it with seven other models. NLP-based coding method has a good effect on PPI prediction task. MCN and MBC extract protein sequence features from local and global perspectives and these two feature extraction layers are based on siamese and ensemble network structures. Siamese-based network structure can keep the features consistent and ensemble based network structure can effectively improve the accuracy of the model.
蛋白质-蛋白质相互作用(PPI)对于许多生化过程非常重要。因此,准确预测 PPI 可以帮助我们更好地理解蛋白质在生化过程中的作用。尽管生物学中有许多预测 PPI 的方法,但它们既耗时又缺乏准确性,因此在 PPI 预测领域构建高效准确的计算模型是必要的。
我们提出了一种新的基于序列的计算方法,称为 DCSE(双通道孪生集成),用于预测潜在的 PPI。在编码层,我们将每个氨基酸视为一个单词,并将其映射到一个 N 维向量中。在特征提取层,我们通过多层卷积神经网络(MCN)和多层双向门控循环单元与卷积神经网络(MBC)从局部和全局角度提取特征。最后,特征提取层的输出被输入到预测层,以输出输入蛋白质对是否会相互作用。MCN 和 MBC 是基于孪生和集成的网络,可以有效提高模型的性能。为了证明我们模型的性能,我们将其与四个基于机器学习的和三个基于深度学习的模型进行了比较。结果表明,我们的方法在所有评价标准上都优于其他模型。我们的模型的准确率、精确率、F1 值、召回率和 MCC 分别为 0.9303、0.9091、0.9268、0.9452 和 0.8609。对于其他七个模型,准确率、精确率、F1 值、召回率和 MCC 的最高值分别为 0.9288、0.9243、0.9246、0.9250 和 0.8572。我们还在不平衡数据集和转移到另一个物种上测试了我们的模型。结果表明我们的模型表现出色。
通过与其他七个模型进行比较,我们的模型取得了最佳性能。基于 NLP 的编码方法对 PPI 预测任务有很好的效果。MCN 和 MBC 从局部和全局角度提取蛋白质序列特征,这两个特征提取层基于孪生和集成网络结构。基于孪生的网络结构可以保持特征的一致性,而基于集成的网络结构可以有效提高模型的准确性。