Department of Computer Science and Engineering, Seoul National University, Seoul, Korea.
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea.
Bioinformatics. 2018 Jul 1;34(13):i254-i262. doi: 10.1093/bioinformatics/bty275.
A large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k-mer based methods. Nevertheless, existing methods have some limitations; k-mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.
In this paper, we introduce DeepFam, an alignment-free method that can extract functional information directly from sequences without the need of multiple sequence alignments. In extensive experiments using the Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset, DeepFam achieved better performance in terms of accuracy and runtime for predicting functions of proteins compared to the state-of-the-art methods, both alignment-free and alignment-based methods. Additionally, we showed that DeepFam has a power of capturing conserved regions to model protein families. In fact, DeepFam was able to detect conserved regions documented in the Prosite database while predicting functions of proteins. Our deep learning method will be useful in characterizing functions of the ever increasing protein sequences.
Codes are available at https://bhi-kimlab.github.io/DeepFam.
新一代测序技术产生了大量新的蛋白质序列,而蛋白质的生化功能分配是一项重要任务。然而,生物实验太昂贵了,无法对如此大量的蛋白质序列进行特征描述,因此蛋白质功能预测主要是通过计算建模方法完成的,如轮廓隐马尔可夫模型(pHMM)和 k-mer 方法。然而,现有的方法存在一些局限性;k-mer 方法不足以准确地分配蛋白质功能,pHMM 不够快,无法处理来自众多基因组项目的大量蛋白质序列。因此,需要一种更准确、更快的蛋白质功能预测方法。
在本文中,我们介绍了 DeepFam,这是一种无比对方法,可以直接从序列中提取功能信息,而无需进行多序列比对。在使用同源基因簇(COGs)和 G 蛋白偶联受体(GPCR)数据集的广泛实验中,DeepFam 在准确性和运行时间方面都优于最先进的方法,包括无比对和比对方法,用于预测蛋白质的功能。此外,我们表明,DeepFam 具有捕获保守区域以模拟蛋白质家族的能力。实际上,DeepFam 能够在预测蛋白质功能时检测到 Prosite 数据库中记录的保守区域。我们的深度学习方法将有助于描述不断增加的蛋白质序列的功能。
代码可在 https://bhi-kimlab.github.io/DeepFam 上获得。