Department of Biosciences and Informatics, Keio University, Yokohama, Japan.
PLoS One. 2010 Sep 24;5(9):e12651. doi: 10.1371/journal.pone.0012651.
With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows.
METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy.
CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.
随着可用基因组序列数量的快速增加,多基因组分析所需的序列数据量是一个具有挑战性的问题。当大规模重排打破基因组中基因顺序的共线性时,基因组比较算法必须首先确定存在于每个基因组中的短且保守序列集,称为锚点。以前,通过渐进比对工具(如 TBA)对多个基因组进行比对,使用 BLASTZ 等两两比对工具来实现多个基因组之间的锚点识别,但随着基因组数量和规模的增加,多个基因组的序列比对的计算需求很快成为一个限制因素。
方法/主要发现:我们的算法名为 Murasaki,它可以在几分钟内使用单个 CPU 对数百兆碱基规模的多个大型序列中的锚点进行识别。Murasaki 的两个高级功能是:(1)自适应散列函数生成,它可以有效地使用任意不匹配模式(间隔种子),从而在实际计算时间内比较多个哺乳动物基因组;(2)可并行化执行,减少所需的wall-clock 和 CPU 时间。Murasaki 可以在 21 小时的 CPU 时间(42 分钟的 wall 时间)内完成 8 个哺乳动物基因组(人类、黑猩猩、恒河猴、猩猩、小鼠、大鼠、狗和牛)的敏感锚定。这是首次对多个哺乳动物基因组进行单遍内核对齐。我们通过将 Murasaki 与基因组比对程序 BLASTZ 和 TBA 进行比较来评估它。我们表明,与 BLASTZ 和 TBA 的二次时间要求相比,Murasaki 可以在线性时间内对多个基因组进行锚定,同时提高整体准确性。
结论/意义:Murasaki 提供了一个开源平台,可以利用长模式、集群计算和新的散列算法,在计算效率上大大优于现有方法,在多个基因组中生成准确的锚点。Murasaki 可在 GPL 下从 http://murasaki.sourceforge.net 获得。