Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China.
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China.
Comput Biol Chem. 2020 Dec;89:107379. doi: 10.1016/j.compbiolchem.2020.107379. Epub 2020 Sep 23.
With the application of new high throughput sequencing technology, a large number of protein sequences is becoming available. Determination of the functional characteristics of these proteins by experiments is an expensive endeavor that requires a lot of time. Furthermore, at the organismal level, such kind of experimental functional analyses can be conducted only for a very few selected model organisms. Computational function prediction methods can be used to fill this gap. The functions of proteins are classified by Gene Ontology (GO), which contains more than 40,000 classifications in three domains, Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Additionally, since proteins have many functions, function prediction represents a multi-label and multi-class problem. We developed a new method to predict protein function from sequence. To this end, natural language model was used to generate word embedding of sequence and learn features from it by deep learning, and additional features to locate every protein. Our method uses the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and have noticeable improvement over several algorithms, such as FFPred, DeepGO, GoFDR and other methods compared on the CAFA3 datasets.
随着高通量测序技术的应用,大量的蛋白质序列变得可用。通过实验来确定这些蛋白质的功能特性是一项昂贵的工作,需要大量的时间。此外,在生物体水平上,这种实验功能分析只能在极少数选定的模式生物中进行。计算功能预测方法可以用来填补这一空白。蛋白质的功能是通过基因本体论 (GO) 分类的,GO 包含三个领域(分子功能 (MF)、生物过程 (BP) 和细胞成分 (CC))的 40000 多个分类。此外,由于蛋白质具有许多功能,功能预测代表了一个多标签和多类问题。我们开发了一种从序列预测蛋白质功能的新方法。为此,我们使用自然语言模型生成序列的词嵌入,并通过深度学习从其中学习特征,并为每个蛋白质定位额外的特征。我们的方法使用 GO 类之间的依赖关系作为背景信息来构建深度学习模型。我们使用由计算功能注释评估 (CAFA) 建立的标准来评估我们的方法,与 FFPred、DeepGO、GoFDR 等方法相比,在 CAFA3 数据集上有显著的改进。