Storm Christian E V, Sonnhammer Erik L L
Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden.
Genome Res. 2003 Oct;13(10):2353-62. doi: 10.1101/gr1305203.
One of the most reliable methods for protein function annotation is to transfer experimentally known functions from orthologous proteins in other organisms. Most methods for identifying orthologs operate on a subset of organisms with a completely sequenced genome, and treat proteins as single-domain units. However, it is well known that proteins are often made up of several independent domains, and there is a wealth of protein sequences from genomes that are not completely sequenced. A comprehensive set of protein domain families is found in the Pfam database. We wanted to apply orthology detection to Pfam families, but first some issues needed to be addressed. First, orthology detection becomes impractical and unreliable when too many species are included. Second, shorter domains contain less information. It is therefore important to assess the quality of the orthology assignment and avoid very short domains altogether. We present a database of orthologous protein domains in Pfam called HOPS: Hierarchical grouping of Orthologous and Paralogous Sequences. Orthology is inferred in a hierarchic system of phylogenetic subgroups using ortholog bootstrapping. To avoid the frequent errors stemming from horizontally transferred genes in bacteria, the analysis is presently limited to eukaryotic genes. The results are accessible in the graphical browser NIFAS, a Java tool originally developed for analyzing phylogenetic relations within Pfam families. The method was tested on a set of curated orthologs with experimentally verified function. In comparison to tree reconciliation with a complete species tree, our approach finds significantly more orthologs in the test set. Examples for investigating gene fusions and domain recombination using HOPS are given.
蛋白质功能注释最可靠的方法之一是从其他生物体中的直系同源蛋白质转移实验已知的功能。大多数用于鉴定直系同源物的方法作用于具有完全测序基因组的生物体子集,并将蛋白质视为单结构域单元。然而,众所周知蛋白质通常由几个独立的结构域组成,并且存在来自未完全测序基因组的大量蛋白质序列。在Pfam数据库中发现了一套全面的蛋白质结构域家族。我们想将直系同源性检测应用于Pfam家族,但首先需要解决一些问题。首先,当包含太多物种时,直系同源性检测变得不切实际且不可靠。其次,较短的结构域包含的信息较少。因此,评估直系同源性分配的质量并完全避免非常短的结构域很重要。我们提出了一个名为HOPS的Pfam直系同源蛋白质结构域数据库:直系同源和旁系同源序列的层次分组。使用直系同源物自展法在系统发育亚组的层次系统中推断直系同源性。为了避免细菌中水平转移基因引起的频繁错误,目前的分析仅限于真核基因。结果可在图形浏览器NIFAS中获取,NIFAS是一个最初开发用于分析Pfam家族内系统发育关系的Java工具。该方法在一组具有经实验验证功能的精选直系同源物上进行了测试。与使用完整物种树的树调和相比,我们的方法在测试集中发现了明显更多的直系同源物。给出了使用HOPS研究基因融合和结构域重组的示例。