Flomin Dan, Pellow David, Shamir Ron
Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel.
J Comput Biol. 2022 Aug;29(8):825-838. doi: 10.1089/cmb.2021.0599. Epub 2022 May 6.
The rapid continuous growth of deep sequencing experiments requires development and improvement of many bioinformatic applications for analysis of large sequencing data sets, including -mer counting and assembly. Several applications reduce memory usage by binning sequences. Binning is done by using minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the data set. Our method repeatedly samples the data set and modifies the order so as to flatten the -mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-the-art memory-efficient -mer counter, and were able to reduce its memory footprint by 30%-50% for large , with only a minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across data sets from the same species, with little or no order change. This enables memory reduction with essentially no increase in runtime.
深度测序实验的快速持续增长需要开发和改进许多用于分析大型测序数据集的生物信息学应用程序,包括k-mer计数和组装。有几种应用程序通过对序列进行分箱来减少内存使用。分箱是通过使用最小化器方案来完成的,这些方案依赖于最小化器的特定顺序。已经证明,顺序的选择对应用程序的性能有重大影响。在这里,我们介绍一种根据数据集定制顺序的方法。我们的方法反复对数据集进行采样并修改顺序,以便使k-mer负载分布在最小化器之间趋于平坦。我们将我们的方法集成到Gerbil中,这是一种最先进的内存高效k-mer计数器,对于大型k,我们能够将其内存占用减少30%-50%,而运行时仅略有增加。我们的测试还表明,我们的方法产生的顺序在跨同一物种的数据集转移时产生了更好的结果,顺序变化很小或没有变化。这使得在运行时基本不增加的情况下减少了内存。