Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i232-i241. doi: 10.1093/bioinformatics/btad218.
Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors.
We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis.
SubseqHash is freely available at https://github.com/Shao-Group/subseqhash.
现代计算密集型序列分析任务的方法(例如读映射、序列比对、基因组组装等)通常首先将每个序列转换为短的、规则长度的种子列表,以便使用紧凑的数据结构和高效的算法来处理不断增长的大规模数据。使用 kmers(长度为 k 的子字符串)的种子方法在处理低突变/错误率的测序数据方面取得了巨大的成功。然而,对于高错误率的测序数据,它们的效果要差得多,因为 kmers 无法容忍错误。
我们提出了 SubseqHash,这是一种使用子序列而不是子字符串作为种子的策略。形式上,SubseqHash 根据给定的所有长度为 k 的字符串的顺序,将长度为 n 的字符串映射到其长度为 k(k<n)的最小子序列。通过枚举找到字符串的最小子序列是不切实际的,因为子序列的数量呈指数增长。为了克服这个障碍,我们提出了一个新的算法框架,该框架由一个专门设计的顺序(称为 ABC 顺序)和一个在多项式时间内根据 ABC 顺序计算最小子序列的算法组成。我们首先证明 ABC 顺序具有所需的性质,并且使用 ABC 顺序的哈希碰撞概率接近 Jaccard 指数。然后,我们证明 SubseqHash 在产生三个关键应用程序(读映射、序列比对和重叠检测)的高质量种子匹配方面,大大优于基于子字符串的种子方法。SubseqHash 在处理高错误率方面取得了重大算法突破,我们预计它将被广泛应用于长读分析。
SubseqHash 可在 https://github.com/Shao-Group/subseqhash 上免费获得。