Chen Xin, Zheng Jie, Fu Zheng, Nan Peng, Zhong Yang, Lonardi Stefano, Jiang Tao
Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2005 Oct-Dec;2(4):302-15. doi: 10.1109/TCBB.2005.48.
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at a genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement. First, the problem is formulated as that of computing the signed reversal distance with duplicates between the two genomes of interest. Then, the problem is decomposed into two new optimization problems, called minimum common partition and maximum cycle decomposition, for which efficient heuristic algorithms are given. Following this approach, we have implemented a high-throughput system for assigning orthologs on a genome scale, called SOAR, and tested it on both simulated data and real genome sequence data. Compared to a recent ortholog assignment method based entirely on homology search (called INPARANOID), SOAR shows a marginally better performance in terms of sensitivity on the real data set because it is able to identify several correct orthologous pairs that are missed by INPARANOID. The simulation results demonstrate that SOAR, in general, performs better than the iterated exemplar algorithm in terms of computing the reversal distance and assigning correct orthologs.
在一对基因组之间确定直系同源基因是比较基因组学中的一个基本且具有挑战性的问题。现有的基于DNA或蛋白质序列相似性来确定直系同源基因的方法,当序列相似性无法清晰界定同一家族基因间的进化关系时,可能会做出错误的分配。在本文中,我们提出了一种新的直系同源基因分配方法,该方法在基因组层面同时考虑了序列相似性和进化事件,其中直系同源基因被假定在基因组重排的最简约进化场景中相互对应。首先,将该问题表述为计算两个目标基因组之间带重复的有符号反转距离的问题。然后,将该问题分解为两个新的优化问题,即最小公共划分和最大循环分解,并针对这两个问题给出了高效的启发式算法。按照这种方法,我们实现了一个用于在基因组规模上分配直系同源基因的高通量系统,称为SOAR,并在模拟数据和真实基因组序列数据上对其进行了测试。与最近一种完全基于同源性搜索的直系同源基因分配方法(称为INPARANOID)相比,SOAR在真实数据集上的敏感性方面表现略好,因为它能够识别出INPARANOID遗漏的几个正确的直系同源对。模拟结果表明,总体而言,在计算反转距离和分配正确的直系同源基因方面,SOAR比迭代范例算法表现更好。