Boucher Christina, Gagie Travis, Tomohiro I, Köppl Dominik, Langmead Ben, Manzini Giovanni, Navarro Gonzalo, Pacheco Alejandro, Rossi Massimiliano
U Florida Gainesville, USA.
Dalhousie U Halifax, Canada.
Proc Data Compress Conf. 2021 Mar;2021:193-202. doi: 10.1109/dcc50243.2021.00027. Epub 2021 May 10.
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. Our code is available at https://github.com/koeppl/phoni.
计算模式相对于文本的匹配统计信息是生物信息学中的一项基本任务,但当文本是高度压缩的基因组数据库时,这是一项艰巨的任务。Bannai等人针对这种情况给出了一个有效的解决方案,Rossi等人最近对其进行了实现,但它对模式进行了两遍处理,并且在第一遍处理期间为每个字符缓冲一个指针。在本文中,我们简化了他们的解决方案并使其成为流处理方式,代价是稍微降低了处理速度。这意味着,首先,我们可以并行计算多个长模式(例如整个人类染色体)的匹配统计信息,同时仍然使用合理数量的随机存取存储器(RAM);其次,我们可以在线以低延迟计算匹配统计信息,从而快速识别出相对于数据库而言模式何时变得不可压缩。我们的代码可在https://github.com/koeppl/phoni上获取。