MSOAR：一种基于基因组重排的高通量直系同源物分配系统。

MSOAR: a high-throughput ortholog assignment system based on genome rearrangement.

作者信息

Fu Zheng, Chen Xin, Vacic Vladimir, Nan Peng, Zhong Yang, Jiang Tao

机构信息

Department of Computer Science and Engineering, University of California, Riverside, California 92521, USA.

出版信息

J Comput Biol. 2007 Nov;14(9):1160-75. doi: 10.1089/cmb.2007.0048.

DOI:10.1089/cmb.2007.0048

PMID:17990975

Abstract

The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics, since many computational methods for solving various biological problems critically rely on bona fide orthologs as input. While it is usually done using sequence similarity search, we recently proposed a new combinatorial approach that combines sequence similarity and genome rearrangement. This paper continues the development of the approach and unites genome rearrangement events and (post-speciation) duplication events in a single framework under the parsimony principle. In this framework, orthologous genes are assumed to correspond to each other in the most parsimonious evolutionary scenario involving both genome rearrangement and (post-speciation) gene duplication. Besides several original algorithmic contributions, the enhanced method allows for the detection of inparalogs. Following this approach, we have implemented a high-throughput system for ortholog assignment on a genome scale, called MSOAR, and applied it to human and mouse genomes. As the result will show, MSOAR is able to find 99 more true orthologs than the INPARANOID program did. In comparison to the iterated exemplar algorithm on simulated data, MSOAR performed favorably in terms of assignment accuracy. We also validated our predicted main ortholog pairs between human and mouse using public ortholog assignment datasets, synteny information, and gene function classification. These test results indicate that our approach is very promising for genome-wide ortholog assignment. Supplemental material and MSOAR program are available at http://msoar.cs.ucr.edu.

摘要

在一对基因组之间确定直系同源基因是比较基因组学中的一个基本且具有挑战性的问题，因为许多用于解决各种生物学问题的计算方法都严重依赖真正的直系同源基因作为输入。虽然通常是通过序列相似性搜索来完成，但我们最近提出了一种新的组合方法，该方法结合了序列相似性和基因组重排。本文继续该方法的开发，并在简约原则下将基因组重排事件和（物种形成后的）复制事件统一在一个单一框架中。在这个框架中，直系同源基因被假定在涉及基因组重排和（物种形成后的）基因复制的最简约进化场景中相互对应。除了一些原创的算法贡献外，增强后的方法还能够检测到旁系同源基因。遵循这种方法，我们在基因组规模上实现了一个用于直系同源基因确定的高通量系统，称为MSOAR，并将其应用于人类和小鼠基因组。结果将表明，MSOAR比INPARANOID程序能够多找到99个真正的直系同源基因。与在模拟数据上的迭代范例算法相比，MSOAR在确定准确性方面表现良好。我们还使用公共直系同源基因确定数据集、共线性信息和基因功能分类对我们预测的人类和小鼠之间的主要直系同源基因对进行了验证。这些测试结果表明，我们的方法在全基因组直系同源基因确定方面非常有前景。补充材料和MSOAR程序可在http://msoar.cs.ucr.edu获取。