Department of Computer Science, University of California, Riverside, CA 92521, USA.
BMC Bioinformatics. 2010 Jan 6;11:10. doi: 10.1186/1471-2105-11-10.
Ortholog assignment is a critical and fundamental problem in comparative genomics, since orthologs are considered to be functional counterparts in different species and can be used to infer molecular functions of one species from those of other species. MSOAR is a recently developed high-throughput system for assigning one-to-one orthologs between closely related species on a genome scale. It attempts to reconstruct the evolutionary history of input genomes in terms of genome rearrangement and gene duplication events. It assumes that a gene duplication event inserts a duplicated gene into the genome of interest at a random location (i.e., the random duplication model). However, in practice, biologists believe that genes are often duplicated by tandem duplications, where a duplicated gene is located next to the original copy (i.e., the tandem duplication model).
In this paper, we develop MSOAR 2.0, an improved system for one-to-one ortholog assignment. For a pair of input genomes, the system first focuses on the tandemly duplicated genes of each genome and tries to identify among them those that were duplicated after the speciation (i.e., the so-called inparalogs), using a simple phylogenetic tree reconciliation method. For each such set of tandemly duplicated inparalogs, all but one gene will be deleted from the concerned genome (because they cannot possibly appear in any one-to-one ortholog pairs), and MSOAR is invoked. Using both simulated and real data experiments, we show that MSOAR 2.0 is able to achieve a better sensitivity and specificity than MSOAR. In comparison with the well-known genome-scale ortholog assignment tool InParanoid, Ensembl ortholog database, and the orthology information extracted from the well-known whole-genome multiple alignment program MultiZ, MSOAR 2.0 shows the highest sensitivity. Although the specificity of MSOAR 2.0 is slightly worse than that of InParanoid in the real data experiments, it is actually better than that of InParanoid in the simulation tests.
Our preliminary experimental results demonstrate that MSOAR 2.0 is a highly accurate tool for one-to-one ortholog assignment between closely related genomes. The software is available to the public for free and included as online supplementary material.
直系同源物的分配是比较基因组学中的一个关键和基本问题,因为直系同源物被认为是不同物种中的功能对应物,可以用来从其他物种推断一个物种的分子功能。MSOAR 是一种最近开发的高通量系统,可在基因组范围内为密切相关的物种分配一对一的直系同源物。它试图根据基因组重排和基因复制事件来重建输入基因组的进化历史。它假设基因复制事件将一个复制的基因随机插入到感兴趣的基因组中(即随机复制模型)。然而,在实践中,生物学家认为基因通常通过串联复制进行复制,其中一个复制的基因位于原始拷贝的旁边(即串联复制模型)。
在本文中,我们开发了 MSOAR 2.0,这是一种用于一对一直系同源物分配的改进系统。对于一对输入基因组,系统首先关注每个基因组中的串联重复基因,并尝试使用简单的系统发育树协调方法来识别其中那些在物种形成后复制的基因(即所谓的同基因)。对于每个这样的串联重复同基因集,除了一个基因之外,所有基因都将从相关基因组中删除(因为它们不可能出现在任何一对一的直系同源物对中),然后调用 MSOAR。使用模拟和真实数据实验,我们表明 MSOAR 2.0 能够实现比 MSOAR 更好的灵敏度和特异性。与著名的全基因组直系同源物分配工具 InParanoid、Ensembl 直系同源物数据库以及来自著名的全基因组多重比对程序 MultiZ 的同源信息相比,MSOAR 2.0 显示出最高的灵敏度。尽管在真实数据实验中,MSOAR 2.0 的特异性略低于 InParanoid,但实际上它在模拟测试中的特异性优于 InParanoid。
我们的初步实验结果表明,MSOAR 2.0 是一种用于密切相关基因组之间一对一直系同源物分配的高度准确工具。该软件可供公众免费使用,并包含在在线补充材料中。