Coarfa Cristian, Milosavljevic Aleksandar
Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA.
Pac Symp Biocomput. 2008:102-13.
Many applications of next-generation sequencing technologies involve anchoring of a sequence fragment or a tag onto a corresponding position on a reference genome assembly. Positional Hashing method, implemented in the Pash 2.0 program, is specifically designed for the task of high-volume anchoring. In this article we present multi-diagonal gapped kmer collation and other improvements introduced in Pash 2.0 that further improve accuracy and speed of Positional Hashing. The goal of this article is to show that gapped kmer matching with cross-diagonal collation suffices for anchoring across close evolutionary distances and for the purpose of human resequencing. We propose a benchmark for evaluating the performance of anchoring programs that captures key parameters in specific applications, including duplicative structure of genomes of humans and other species. We demonstrate speedups of up to tenfold in large-scale anchoring experiments achieved by PASH 2.0 when compared to BLAT, another similarity search program frequently used for anchoring.
下一代测序技术的许多应用都涉及将序列片段或标签锚定到参考基因组组装的相应位置上。Pash 2.0程序中实现的位置哈希方法专门针对大量锚定任务而设计。在本文中,我们介绍了Pash 2.0中引入的多对角线间隔kmer核对及其他改进,这些改进进一步提高了位置哈希的准确性和速度。本文的目的是表明,采用交叉对角线核对的间隔kmer匹配足以跨越较近的进化距离进行锚定,并适用于人类重测序目的。我们提出了一个用于评估锚定程序性能的基准,该基准涵盖特定应用中的关键参数,包括人类和其他物种基因组的重复结构。与另一个常用于锚定的相似性搜索程序BLAT相比,我们展示了在大规模锚定实验中PASH 2.0实现了高达十倍的加速。