The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Genes (Basel). 2019 Nov 12;10(11):924. doi: 10.3390/genes10110924.
Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by . Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on and datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems.
自相互作用蛋白(SIPs)在当前分子生物学中至关重要。过去几年已经开发了许多传统的生物学实验方法来预测 SIPs。然而,这些方法成本高、耗时且效率低下,并且经常限制了它们在预测 SIPs 中的使用。因此,计算方法的发展应运而生。在本文中,我们首次提出了一种新的深度学习模型,该模型结合了自然语言处理(NLP)方法,从蛋白质序列信息中预测潜在的 SIPs。更具体地说,通过. 将蛋白质序列从头组装。然后,我们使用自然语言处理(NLP)技术为每个蛋白质序列获取全局向量表示。最后,基于已知自相互作用和非相互作用蛋白质的知识,训练多粒度级联森林模型来预测 SIPs。在 和 数据集上进行了综合实验,分别获得了 91.45%和 93.12%的准确率。从我们的评估结果来看,实验结果表明,使用氨基酸语义信息对于解决包含自相互作用和非相互作用蛋白质对的序列问题非常有帮助。这项工作将对各种生物分类问题具有潜在的应用价值。