IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2208-2217. doi: 10.1109/TCBB.2020.2968882. Epub 2021 Dec 8.
Knowledge of protein functions plays an important role in biology and medicine. With the rapid development of high-throughput technologies, a huge number of proteins have been discovered. However, there are a great number of proteins without functional annotations. A protein usually has multiple functions and some functions or biological processes require interactions of a plurality of proteins. Additionally, Gene Ontology provides a useful classification for protein functions and contains more than 40,000 terms. We propose a deep learning framework called DeepGOA to predict protein functions with protein sequences and protein-protein interaction (PPI) networks. For protein sequences, we extract two types of information: sequence semantic information and subsequence-based features. We use the word2vec technique to numerically represent protein sequences, and utilize a Bi-directional Long and Short Time Memory (Bi-LSTM) and multi-scale convolutional neural network (multi-scale CNN) to obtain the global and local semantic features of protein sequences, respectively. Additionally, we use the InterPro tool to scan protein sequences for extracting subsequence-based information, such as domains and motifs. Then, the information is plugged into a neural network to generate high-quality features. For the PPI network, the Deepwalk algorithm is applied to generate its embedding information of PPI. Then the two types of features are concatenated together to predict protein functions. To evaluate the performance of DeepGOA, several different evaluation methods and metrics are utilized. The experimental results show that DeepGOA outperforms DeepGO and BLAST.
蛋白质功能的知识在生物学和医学中起着重要作用。随着高通量技术的飞速发展,已经发现了大量的蛋白质。然而,有大量的蛋白质没有功能注释。蛋白质通常具有多种功能,某些功能或生物过程需要多种蛋白质的相互作用。此外,基因本体论(Gene Ontology)为蛋白质功能提供了有用的分类,包含超过 40000 个术语。我们提出了一个称为 DeepGOA 的深度学习框架,用于使用蛋白质序列和蛋白质-蛋白质相互作用(PPI)网络预测蛋白质功能。对于蛋白质序列,我们提取两种类型的信息:序列语义信息和基于子序列的特征。我们使用 word2vec 技术对蛋白质序列进行数值表示,并利用双向长短期记忆(Bi-LSTM)和多尺度卷积神经网络(multi-scale CNN)分别获取蛋白质序列的全局和局部语义特征。此外,我们使用 InterPro 工具扫描蛋白质序列以提取基于子序列的信息,如结构域和基序。然后,将信息插入神经网络中以生成高质量的特征。对于 PPI 网络,应用 Deepwalk 算法生成其 PPI 的嵌入信息。然后将两种类型的特征连接在一起以预测蛋白质功能。为了评估 DeepGOA 的性能,使用了几种不同的评估方法和指标。实验结果表明,DeepGOA 优于 DeepGO 和 BLAST。