Department of Computer Science, University of Helsinki, Helsinki, Finland.
Faculty of Computer Science, Dalhousie University, Halifax, Canada.
Bioinformatics. 2022 Jun 24;38(Suppl 1):i177-i184. doi: 10.1093/bioinformatics/btac226.
Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples.
We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h.
https://github.com/jnalanko/syotti.
Supplementary data are available at Bioinformatics online.
诱饵富集是一种越来越普遍的方法,因为它已被证明可以成功地扩增宏基因组样本中的目标区域。在这种方法中,设计、制造并应用一组合成探针(“诱饵”)到碎片化的宏基因组 DNA。探针与碎片化的 DNA 结合,未结合的 DNA 被冲洗掉,留下结合的片段进行测序扩增。Metsky 等人证明,诱饵富集能够在宏基因组样本中检测到大量的人类病毒病原体。
我们通过定义最小诱饵覆盖问题来形式化诱饵设计问题,表明即使在非常严格的假设下,该问题也是 NP 难的,并设计了一种利用简洁数据结构的有效启发式算法。我们将我们的方法称为 Syotti。Syotti 的运行时间在实践中呈线性缩放,比包括 Metsky 等人的方法在内的最先进方法至少快一个数量级。同时,我们的方法生成的诱饵集比竞争方法生成的诱饵集小,同时留下的未覆盖位置也更少。最后,我们表明,Syotti 仅需 25 分钟即可为一个由 1000 个相关细菌亚种的 30 亿个核苷酸组成的数据集设计诱饵,而 Metsky 等人的方法显示出明显的超线性运行时间,并且在 72 小时内无法处理甚至是数据的 17%的子集。
https://github.com/jnalanko/syotti。
补充数据可在生物信息学在线获得。