Internet Technology and Data Science Lab, Ghent University, Ghent, Belgium.
Center for Bioinformatics Saar, Saarland University, Saarbrücken, Germany.
J Comput Biol. 2024 Oct;31(10):975-989. doi: 10.1089/cmb.2024.0664. Epub 2024 Sep 30.
This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of errors. While existing literature offers designed schemes for up to = 4 errors, designing search schemes for larger values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.
这项研究介绍了一种开创性的方法,用于自动创建无损近似模式匹配的搜索方案。搜索方案是组合结构,定义了对分区模式的一系列搜索。每个搜索指定了这些部分的处理顺序,以及模式中每个部分的错误数量的累积下限和上限。这些搜索一起确保在预定义的错误限制内识别搜索模式的所有近似出现。虽然现有文献提供了最多 = 4 个错误的设计方案,但设计 值较大的搜索方案会导致计算成本的增加。我们的方法集成了贪婪算法和新颖的整数线性规划(ILP)公式,用于设计最多可达 = 7 个错误的高效搜索方案。比较分析表明,我们的 ILP 最优方案在理论和实际上下文中都优于替代策略。此外,我们提出了一种针对特定搜索模式的动态方案选择技术,进一步提高了效率。结合使用,这可以为更高的 值减少多达 53%的运行时间。为了促进搜索方案的生成,我们提出了 Hato,这是一个采用贪婪算法并使用 CPLEX 解决 ILP 的开源软件工具(AGPL-3.0 许可证)。此外,我们引入了 Columba 1.2,这是一个用 C++实现的开源无损读映射器(AGPL-3.0 许可证)。Columba 通过在人类参考基因组中在 24 秒(最大编辑距离为 4)和 75 秒(最大编辑距离为 6)内识别 100,000 个 Illumina 读取(150 bp)的所有近似出现,超过了现有的最先进工具。值得注意的是,我们的研究展示了 Columba 在仅仅 2 小时 15 分钟内对齐具有高错误率和高达 7 的编辑距离的 100,000 个长度为 50 的读取的能力,这是其他无损对齐器无法实现的。此外,Columba 对于这个数据集的映射率比有损工具高四倍。