Suppr超能文献

一种用于从基因组间对称最佳匹配组装同源群簇的低多项式算法。

A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches.

机构信息

Department of Binformatics, Stowers Institute for Medical Research, Kansas City, MO 64110, USA.

出版信息

Bioinformatics. 2010 Jun 15;26(12):1481-7. doi: 10.1093/bioinformatics/btq229. Epub 2010 May 2.

Abstract

MOTIVATION

Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.

RESULTS

In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g(3)) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g(6)). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g(3)log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.

AVAILABILITY AND IMPLEMENTATION

C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/.

SUPPLEMENTARY INFORMATION

Supplementary materials are available at Bioinformatics online.

摘要

动机

在比较基因组学中,识别多个基因组中的直系同源基因是一项基本任务。构建基因组间对称最佳匹配(SymBets)并将它们聚类是定义直系同源的一种流行方法,体现在几个软件程序中。尽管它们被广泛使用,但这些程序的计算复杂度尚未得到彻底研究。

结果

在这项工作中,我们表明在通过 SymBets 的所有三角形进行迭代的标准方法中,内存至少按这些三角形的数量扩展,即 O(g(3))(其中 g 是基因组的数量),并且构建时间按每对的迭代扩展,即 O(g(6))。我们提出了 EdgeSearch 算法,它在 SymBet 图的边而不是 SymBets 的三角形上迭代,因此最坏情况下的复杂度仅为 O(g(3)log g)。在现实稀疏图中,几个优化进一步减少了运行时间。在来自噬菌体(POGs)和 Mollicutes(MOGs)的基因组的两个真实数据集上,EdgeSearch 算法的实现速度比原始算法快一个数量级,并且随着基因组数量的增加,其扩展要好得多,最终结果只有微小差异,并且比流行的 OrthoMCL 程序快 60 倍,其中识别的直系同源物组之间有 90%的重叠。

可用性和实现

可在 ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/ 下载 C++源代码。

补充信息

补充材料可在 Bioinformatics 在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0356/2881409/3ee63044592a/btq229f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验