Jothi Raja, Zotenko Elena, Tasneem Asba, Przytycka Teresa M
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics. 2006 Apr 1;22(7):779-88. doi: 10.1093/bioinformatics/btl009. Epub 2006 Jan 24.
Determining orthology relations among genes across multiple genomes is an important problem in the post-genomic era. Identifying orthologous genes can not only help predict functional annotations for newly sequenced or poorly characterized genomes, but can also help predict new protein-protein interactions. Unfortunately, determining orthology relation through computational methods is not straightforward due to the presence of paralogs. Traditional approaches have relied on pairwise sequence comparisons to construct graphs, which were then partitioned into putative clusters of orthologous groups. These methods do not attempt to preserve the non-transitivity and hierarchic nature of the orthology relation.
We propose a new method, COCO-CL, for hierarchical clustering of homology relations and identification of orthologous groups of genes. Unlike previous approaches, which are based on pairwise sequence comparisons, our method explores the correlation of evolutionary histories of individual genes in a more global context. COCO-CL can be used as a semi-independent method to delineate the orthology/paralogy relation for a refined set of homologous proteins obtained using a less-conservative clustering approach, or as a refiner that removes putative out-paralogs from clusters computed using a more inclusive approach. We analyze our clustering results manually, with support from literature and functional annotations. Since our orthology determination procedure does not employ a species tree to infer duplication events, it can be used in situations when the species tree is unknown or uncertain.
jothi@mail.nih.gov, przytyck@mail.nih.gov
Supplementary materials are available at Bioinformatics online.
确定多个基因组中基因之间的直系同源关系是后基因组时代的一个重要问题。识别直系同源基因不仅有助于预测新测序或特征描述不足的基因组的功能注释,还能帮助预测新的蛋白质-蛋白质相互作用。不幸的是,由于旁系同源物的存在,通过计算方法确定直系同源关系并非易事。传统方法依赖于成对序列比较来构建图,然后将其划分为假定的直系同源组簇。这些方法并未试图保留直系同源关系的非传递性和层次性质。
我们提出了一种新方法COCO-CL,用于同源关系的层次聚类和基因直系同源组的识别。与以往基于成对序列比较的方法不同,我们的方法在更全局的背景下探索单个基因进化历史的相关性。COCO-CL既可以作为一种半独立的方法,用于为使用不太保守的聚类方法获得的一组精细同源蛋白质描绘直系同源/旁系同源关系,也可以作为一种精炼器,从使用更具包容性的方法计算的簇中去除假定的外旁系同源物。我们在文献和功能注释的支持下,手动分析聚类结果。由于我们的直系同源确定程序不使用物种树来推断复制事件,因此它可用于物种树未知或不确定的情况。
jothi@mail.nih.gov,przytyck@mail.nih.gov
补充材料可在《生物信息学》在线获取。