Brown Duncan P, Krishnamurthy Nandini, Sjölander Kimmen
Department of Bioengineering, University of California, Berkeley, California, United States of America.
PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.
Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.
通过同源性进行功能预测被广泛用于为那些缺乏或仅有有限功能实验证据的基因提供初步的功能注释。这种方法已被证明容易出现系统误差,包括注释错误通过序列数据库的渗透。系统发育基因组分析避免了功能预测中的这些错误,但对于高通量应用来说,一直难以实现自动化。为了解决这一局限性,我们提出了一种用于蛋白质系统发育基因组分类的计算高效的流程。该流程使用SCI-PHY(系统发育基因组学中的亚家族分类)算法进行自动亚家族识别,随后构建亚家族隐马尔可夫模型(HMM)。一种使用家族和亚家族HMM的简单且计算高效的评分方案能够将新序列分类到蛋白质家族和亚家族中。使用逻辑回归将代表全新亚家族的序列与那些可以分类到输入训练集中亚家族的序列区分开来。亚家族HMM参数使用信息共享协议进行估计,使得即使包含单个序列的亚家族也能从定义整个家族或相关亚家族的保守模式中受益。SCI-PHY亚家族与专家定义的功能亚型以及系统发育分析发现的保守进化枝密切对应。对亚家族和家族HMM性能的广泛比较表明,亚家族HMM在序列数据库搜索中显著提高了同源和非同源蛋白质之间的区分度。亚家族HMM还提供了极高的分类特异性,可用于预测全新的亚型。位于http://phylogenomics.berkeley.edu/SCI-PHY/的SCI-PHY网络服务器允许用户上传多序列比对进行亚家族识别和亚家族HMM构建。希望提供自己亚家族定义的生物学家也可以这样做。网页上提供了源代码。伯克利系统发育基因组学小组的PhyloFacts资源在http://phylogenomics.berkeley.edu/phylofacts/上包含了针对40000多个蛋白质家族和结构域的预先计算的亚家族预测和亚家族HMM。