Department of Molecular, Cell and Developmental Biology, University of California at Los Angeles, Los Angeles, CA 90095, USA.
Nucleic Acids Res. 2012 Oct;40(19):e152. doi: 10.1093/nar/gks631. Epub 2012 Jul 11.
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.
我们开发了 GFam,这是一个用于基因/蛋白质家族自动注释的平台。GFam 为基因组计划和模式生物资源提供了一个框架,用于构建基于域的家族,得出有意义的功能标签,并提供了一种无缝的方法,可在周期性的基因组更新中传播功能注释。GFam 是一种混合方法,它使用贪婪算法从 InterPro 注释中链接组件域,这些注释由其 12 个成员资源提供,然后对未注释的序列区域进行基于序列的连通组件分析,为每个序列推导出一致的域结构,并随后根据常见的结构生成家族。我们的集成方法将序列覆盖率提高了 7.2 个百分点,残基覆盖率提高了 14.6 个百分点,相对于 Arabidopsis 蛋白质组中 InterPro 内最佳单一成分数据库的覆盖率提高了 7.2 个百分点和残基覆盖率提高了 14.6 个百分点。GFam 的真正威力在于最大化不同 InterPro 数据源提供的注释,这些数据源为序列的不同区域提供特定于资源的覆盖范围。GFam 捕获更高的序列和残基覆盖率的能力可用于基因组注释、比较基因组学和功能研究。GFam 是一种通用软件,可用于任何蛋白质序列集合。该软件是开源的,可以从 http://www.paccanarolab.org/software/gfam/ 获得。