Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia.
Nucleic Acids Res. 2012 Mar;40(6):e44. doi: 10.1093/nar/gkr1261. Epub 2011 Dec 30.
Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2.
广义而言,同源基因分配的计算方法是一个三步骤的过程:(i)在基因组之间识别所有假定的同源物,(ii)识别基因锚点,(iii)根据它们的顺序和上下文将锚点链接起来以识别最佳基因匹配。在本文中,我们设计了两种方法来改进该流程的两个重要方面[特别是步骤(ii)和(iii)]。首先,计算序列相似性数据[步骤(i)]对于大型序列集来说是一项计算密集型任务,这在同源基因分配管道中形成了瓶颈。我们设计了一种快速且高度可扩展的基于 k-mer 计数的排序-连接方法(afree),以快速比较大型蛋白质序列集中的所有序列对,从而识别假定的同源物。其次,可用性复杂的基因组包含具有复杂进化事件(例如复制)的大型基因家族,使得分配同源基因和共同源基因的任务变得困难。在这里,我们开发了一种迭代图匹配策略,其中在每次迭代中,都会确定最佳的基因分配,从而得到一组同源基因和共同源基因。我们发现 afree 算法比现有方法更快,并且在识别相似基因方面保持了很高的准确性。迭代图匹配策略在识别复杂基因关系方面也表现出了很高的准确性。afree 可从 http://vbc.med.monash.edu.au/∼kmahmood/afree 获得。完整的同源基因分配管道(包括 afree 和迭代图匹配方法)可从 http://vbc.med.monash.edu.au/∼kmahmood/EGM2 获得。