Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA.
Nat Commun. 2022 Oct 28;13(1):6430. doi: 10.1038/s41467-022-33869-7.
Computational identification and quantification of distinct microbes from high throughput sequencing data is crucial for our understanding of human health. Existing methods either use accurate but computationally expensive alignment-based approaches or less accurate but computationally fast alignment-free approaches, which often fail to correctly assign reads to genomes. Here we introduce CAMMiQ, a combinatorial optimization framework to identify and quantify distinct genomes (specified by a database) in a metagenomic dataset. As a key methodological innovation, CAMMiQ uses substrings of variable length and those that appear in two genomes in the database, as opposed to the commonly used fixed-length, unique substrings. These substrings allow to accurately decouple mixtures of highly similar genomes resulting in higher accuracy than the leading alternatives, without requiring additional computational resources, as demonstrated on commonly used benchmarking datasets. Importantly, we show that CAMMiQ can distinguish closely related bacterial strains in simulated metagenomic and real single-cell metatranscriptomic data.
从高通量测序数据中计算识别和量化不同的微生物对于我们理解人类健康至关重要。现有的方法要么使用准确但计算成本高的基于比对的方法,要么使用准确性较低但计算速度快的无比对方法,但这些方法往往无法正确地将reads 分配到基因组上。在这里,我们介绍了 CAMMiQ,这是一种组合优化框架,用于在宏基因组数据集中识别和量化不同的基因组(由数据库指定)。作为一个关键的方法学创新,CAMMiQ 使用可变长度的子字符串和数据库中两个基因组中出现的子字符串,而不是常用的固定长度、唯一的子字符串。这些子字符串可以准确地分离高度相似的基因组混合物,从而比领先的替代方法具有更高的准确性,而无需额外的计算资源,这在常用的基准数据集上得到了验证。重要的是,我们表明 CAMMiQ 可以区分模拟宏基因组和真实单细胞宏转录组数据中密切相关的细菌菌株。