Krishnamurthy Nandini, Brown Duncan, Sjölander Kimmen
Department of BioEngineering, 473 Evans Hall #1762, University of California, Berkeley, CA 94720-1762, USA.
BMC Evol Biol. 2007 Feb 8;7 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2148-7-S1-S12.
Function prediction by transfer of annotation from the top database hit in a homology search has been shown to be prone to systematic error. Phylogenomic analysis reduces these errors by inferring protein function within the evolutionary context of the entire family. However, accuracy of function prediction for multi-domain proteins depends on all members having the same overall domain structure. By contrast, most common homolog detection methods are optimized for retrieving local homologs, and do not address this requirement.
We present FlowerPower, a novel clustering algorithm designed for the identification of global homologs as a precursor to structural phylogenomic analysis. Similar to methods such as PSIBLAST, FlowerPower employs an iterative approach to clustering sequences. However, rather than using a single HMM or profile to expand the cluster, FlowerPower identifies subfamilies using the SCI-PHY algorithm and then selects and aligns new homologs using subfamily hidden Markov models. FlowerPower is shown to outperform BLAST, PSI-BLAST and the UCSC SAM-Target 2K methods at discrimination between proteins in the same domain architecture class and those having different overall domain structures.
Structural phylogenomic analysis enables biologists to avoid the systematic errors associated with annotation transfer; clustering sequences based on sharing the same domain architecture is a critical first step in this process. FlowerPower is shown to consistently identify homologous sequences having the same domain architecture as the query.
FlowerPower is available as a webserver at http://phylogenomics.berkeley.edu/flowerpower/.
通过在同源性搜索中从顶级数据库匹配项转移注释来进行功能预测已被证明容易出现系统误差。系统发育基因组分析通过在整个家族的进化背景下推断蛋白质功能来减少这些误差。然而,多结构域蛋白质功能预测的准确性取决于所有成员具有相同的整体结构域结构。相比之下,大多数常见的同源物检测方法是针对检索局部同源物进行优化的,并未满足这一要求。
我们提出了FlowerPower,这是一种新颖的聚类算法,设计用于识别全局同源物,作为结构系统发育基因组分析的前奏。与PSIBLAST等方法类似,FlowerPower采用迭代方法对序列进行聚类。然而,FlowerPower不是使用单个隐马尔可夫模型(HMM)或谱来扩展聚类,而是使用SCI-PHY算法识别亚家族,然后使用亚家族隐马尔可夫模型选择并比对新的同源物。在区分具有相同结构域结构类别的蛋白质和具有不同整体结构域结构的蛋白质方面,FlowerPower表现优于BLAST、PSI-BLAST和加州大学圣克鲁兹分校的SAM-Target 2K方法。
结构系统发育基因组分析使生物学家能够避免与注释转移相关联的系统误差;基于共享相同结构域结构对序列进行聚类是这一过程中关键的第一步。结果表明,FlowerPower能够始终如一地识别与查询序列具有相同结构域结构的同源序列。
FlowerPower可作为网络服务器在http://phylogenomics.berkeley.edu/flowerpower/上获取。