Zheng Xiangqun H, Lu Fu, Wang Zhen-Yuan, Zhong Fei, Hoover Jeffrey, Mural Richard
Assays and Bioinformatics, Celera Genomics Corporation, 45 West Gude Drive, Rockville, MD 20850, USA.
Bioinformatics. 2005 Mar;21(6):703-10. doi: 10.1093/bioinformatics/bti045. Epub 2004 Sep 30.
The identification of orthologous gene pairs is generally based on sequence similarity. Gene pairs that are mutually 'best hits' between the genomes being compared are asserted to be orthologs. Although this method identifies most orthologous gene pairs with high confidence, it will miss a fraction of them, especially genes in duplicated gene families. In addition, the approach depends heavily on the completeness and quality of gene annotation. When the gene sequences are not correctly represented the approach is unlikely to find the correct ortholog. To overcome these limitations, we have developed an approach to identify orthologous gene pairs using shared chromosomal synteny and the annotation of protein function.
Assembled mouse and human genomes were used to identify the regions of conserved synteny between these genomes. 'Syntenic anchors' are conserved non-repetitive locations between mouse and human genomes. Using these anchors, we identified blocks of sequences that contain consistently ordered anchors between the two genomes (syntenic blocks). The synteny information has been used to help us identify orthologous gene pairs between mouse and human genomes. The approach combines the mutual selection of the best tBlastX hits between human and mouse transcripts, and inferring gene orthologous relationships based on sharing syntenic anchors, collocating in the same syntenic blocks and sharing the same annotated protein function. Using this approach, we were able to find 19,357 orthologous gene pairs between human and mouse genomes, a 20% increase in the number of orthologs identified by conventional approaches.
直系同源基因对的鉴定通常基于序列相似性。在被比较的基因组之间相互为“最佳匹配”的基因对被认定为直系同源基因。尽管这种方法能以高置信度鉴定出大多数直系同源基因对,但仍会遗漏一部分,尤其是重复基因家族中的基因。此外,该方法严重依赖基因注释的完整性和质量。当基因序列未被正确呈现时,这种方法不太可能找到正确的直系同源基因。为克服这些局限性,我们开发了一种利用共享染色体同线性和蛋白质功能注释来鉴定直系同源基因对的方法。
使用组装好的小鼠和人类基因组来鉴定这些基因组之间的保守同线性区域。“同线性锚点”是小鼠和人类基因组之间保守的非重复位置。利用这些锚点,我们鉴定出了在两个基因组之间包含一致排列锚点的序列块(同线性块)。同线性信息已被用于帮助我们鉴定小鼠和人类基因组之间的直系同源基因对。该方法结合了人类和小鼠转录本之间最佳tBlastX匹配的相互选择,以及基于共享同线性锚点、位于相同同线性块中且共享相同注释蛋白质功能来推断基因直系同源关系。使用这种方法,我们能够在人类和小鼠基因组之间找到19357对直系同源基因对,比传统方法鉴定出的直系同源基因数量增加了20%。