Murasaki：一种快速、可并行化的算法，用于从多个基因组中寻找锚点。

Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

机构信息

Department of Biosciences and Informatics, Keio University, Yokohama, Japan.

出版信息

PLoS One. 2010 Sep 24;5(9):e12651. doi: 10.1371/journal.pone.0012651.

DOI:10.1371/journal.pone.0012651

PMID:20885980

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2945767/

Abstract

BACKGROUND

With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows.

METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy.

CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.

摘要

背景

随着可用基因组序列数量的快速增加，多基因组分析所需的序列数据量是一个具有挑战性的问题。当大规模重排打破基因组中基因顺序的共线性时，基因组比较算法必须首先确定存在于每个基因组中的短且保守序列集，称为锚点。以前，通过渐进比对工具（如 TBA）对多个基因组进行比对，使用 BLASTZ 等两两比对工具来实现多个基因组之间的锚点识别，但随着基因组数量和规模的增加，多个基因组的序列比对的计算需求很快成为一个限制因素。

方法/主要发现：我们的算法名为 Murasaki，它可以在几分钟内使用单个 CPU 对数百兆碱基规模的多个大型序列中的锚点进行识别。Murasaki 的两个高级功能是：（1）自适应散列函数生成，它可以有效地使用任意不匹配模式（间隔种子），从而在实际计算时间内比较多个哺乳动物基因组；（2）可并行化执行，减少所需的wall-clock 和 CPU 时间。Murasaki 可以在 21 小时的 CPU 时间（42 分钟的 wall 时间）内完成 8 个哺乳动物基因组（人类、黑猩猩、恒河猴、猩猩、小鼠、大鼠、狗和牛）的敏感锚定。这是首次对多个哺乳动物基因组进行单遍内核对齐。我们通过将 Murasaki 与基因组比对程序 BLASTZ 和 TBA 进行比较来评估它。我们表明，与 BLASTZ 和 TBA 的二次时间要求相比，Murasaki 可以在线性时间内对多个基因组进行锚定，同时提高整体准确性。

结论/意义：Murasaki 提供了一个开源平台，可以利用长模式、集群计算和新的散列算法，在计算效率上大大优于现有方法，在多个基因组中生成准确的锚点。Murasaki 可在 GPL 下从 http://murasaki.sourceforge.net 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f86d/2945767/160b12329643/pone.0012651.g001.jpg

相似文献

Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

PLoS One. 2010 Sep 24;5(9):e12651. doi: 10.1371/journal.pone.0012651.

[Development of a large-scale comparative genome system and its application to the analysis of mycobacteria genomes].

Nihon Hansenbyo Gakkai Zasshi. 2007 Sep;76(3):251-6. doi: 10.5025/hansen.76.251.

G-Anchor: a novel approach for whole-genome comparative mapping utilizing evolutionary conserved DNA sequences.

Gigascience. 2018 May 1;7(5). doi: 10.1093/gigascience/giy017.

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

Cgaln: fast and space-efficient whole-genome alignment.

BMC Bioinformatics. 2010 Apr 30;11:224. doi: 10.1186/1471-2105-11-224.

Accurate anchoring alignment of divergent sequences.

Bioinformatics. 2006 Jan 1;22(1):29-34. doi: 10.1093/bioinformatics/bti772. Epub 2005 Nov 13.

GAME: a simple and efficient whole genome alignment method using maximal exact match filtering.

Comput Biol Chem. 2005 Jun;29(3):244-53. doi: 10.1016/j.compbiolchem.2005.04.004.

Mugsy: fast multiple alignment of closely related whole genomes.

Bioinformatics. 2011 Feb 1;27(3):334-42. doi: 10.1093/bioinformatics/btq665. Epub 2010 Dec 9.

SGA: a grammar-based alignment algorithm.

Comput Methods Programs Biomed. 2007 Apr;86(1):17-20. doi: 10.1016/j.cmpb.2006.12.007. Epub 2007 Jan 30.

Mulan: multiple-sequence local alignment and visualization for studying function and evolution.

Genome Res. 2005 Jan;15(1):184-94. doi: 10.1101/gr.3007205. Epub 2004 Dec 8.

引用本文的文献

Comparative genomics of , a fast-growing pathogen of wild .

Microb Genom. 2023 Oct;9(10). doi: 10.1099/mgen.0.001112.

Hamster PIWI proteins bind to piRNAs with stage-specific size variations during oocyte maturation.

Nucleic Acids Res. 2021 Mar 18;49(5):2700-2720. doi: 10.1093/nar/gkab059.

Genomic Characteristics of the Toxic Bloom-Forming Cyanobacterium NIES-102.

J Genomics. 2020 Jan 1;8:1-6. doi: 10.7150/jgen.40978. eCollection 2020.

Inferring the Minimal Genome of by Comparative Genomics and Transposon Mutagenesis.

mSystems. 2018 Apr 10;3(3). doi: 10.1128/mSystems.00198-17. eCollection 2018 May-Jun.

Complete Genome Sequence of NIES-2481 and Common Genomic Features of Group G .

J Genomics. 2018 Mar 19;6:30-33. doi: 10.7150/jgen.24935. eCollection 2018.

Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus.

PLoS Biol. 2017 Jul 27;15(7):e2002266. doi: 10.1371/journal.pbio.2002266. eCollection 2017 Jul.

Whole-Genome Sequencing and Comparative Genome Analysis of Bacillus subtilis Strains Isolated from Non-Salted Fermented Soybean Foods.

PLoS One. 2015 Oct 27;10(10):e0141369. doi: 10.1371/journal.pone.0141369. eCollection 2015.

Comparison of the terrestrial cyanobacterium Leptolyngbya sp. NIES-2104 and the freshwater Leptolyngbya boryana PCC 6306 genomes.

DNA Res. 2015 Dec;22(6):403-12. doi: 10.1093/dnares/dsv022. Epub 2015 Oct 21.

Genome sequence and comparative analysis of a putative entomopathogenic Serratia isolated from Caenorhabditis briggsae.

BMC Genomics. 2015 Jul 18;16(1):531. doi: 10.1186/s12864-015-1697-8.

Complete Genome Sequence of Microcystis aeruginosa NIES-2549, a Bloom-Forming Cyanobacterium from Lake Kasumigaura, Japan.

Genome Announc. 2015 May 28;3(3):e00551-15. doi: 10.1128/genomeA.00551-15.

本文引用的文献

Upcoming challenges for multiple sequence alignment methods in the high-throughput era.

Bioinformatics. 2009 Oct 1;25(19):2455-65. doi: 10.1093/bioinformatics/btp452. Epub 2009 Jul 30.

ABySS: a parallel assembler for short read sequence data.

Genome Res. 2009 Jun;19(6):1117-23. doi: 10.1101/gr.089532.108. Epub 2009 Feb 27.

Accurate identification of orthologous segments among multiple genomes.

Bioinformatics. 2009 Apr 1;25(7):853-60. doi: 10.1093/bioinformatics/btp070. Epub 2009 Feb 2.

Accurate whole human genome sequencing using reversible terminator chemistry.

Nature. 2008 Nov 6;456(7218):53-9. doi: 10.1038/nature07517.

Space efficient computation of rare maximal exact matches between multiple sequences.

J Comput Biol. 2008 May;15(4):357-77. doi: 10.1089/cmb.2007.0105.

Dynamics of Pseudomonas aeruginosa genome evolution.

Proc Natl Acad Sci U S A. 2008 Feb 26;105(8):3100-5. doi: 10.1073/pnas.0711982105. Epub 2008 Feb 19.

28-way vertebrate alignment and conservation track in the UCSC Genome Browser.

Genome Res. 2007 Dec;17(12):1797-808. doi: 10.1101/gr.6761107. Epub 2007 Nov 5.

The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata.

Nucleic Acids Res. 2008 Jan;36(Database issue):D475-9. doi: 10.1093/nar/gkm884. Epub 2007 Nov 2.

SPEED: a molecular-evolution-based database of mammalian orthologous groups.

Bioinformatics. 2006 Nov 15;22(22):2835-7. doi: 10.1093/bioinformatics/btl471. Epub 2006 Sep 11.

Parametric alignment of Drosophila genomes.

PLoS Comput Biol. 2006 Jun 23;2(6):e73. doi: 10.1371/journal.pcbi.0020073.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Murasaki：一种快速、可并行化的算法，用于从多个基因组中寻找锚点。

Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

机构信息

出版信息

BACKGROUND

背景

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献