Rachtman Eleonora, Bafna Vineet, Mirarab Siavash
Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA.
Department of Computer Science and Engineering, UC San Diego, CA 92093, USA.
NAR Genom Bioinform. 2021 Aug 5;3(3):lqab071. doi: 10.1093/nargab/lqab071. eCollection 2021 Sep.
A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies.
即使数据集中最匹配的序列在进化上与查询序列差异很大,测序读数是否属于某个广泛分类群的大量基因组数据集?例如,低覆盖度基因组测序(抽样)项目要么组装细胞器基因组,要么直接从未组装的读数计算基因组距离。使用未组装的读数需要进行污染检测,因为样本中通常包含来自意外物种群的读数。同样,组装细胞器基因组需要区分细胞器读数和核读数。虽然基于k-mer的方法在读取匹配方面显示出了前景,但先前的研究表明,现有方法对污染检测的敏感性不足。在这里,我们引入了一种新的读取匹配工具CONSULT,它使用局部敏感哈希测试查询中的k-mer是否落在参考数据集的用户指定距离内。利用如今可用的大内存机器,CONSULT库可容纳数万个微生物物种。我们的结果表明,与Kraken-II等领先方法相比,CONSULT在污染检测方面具有更高的真阳性率和更低的假阳性率,并改进了从基因组抽样计算的距离。我们还证明,CONSULT可以区分细胞器读数和核读数,从而显著改进基于抽样的线粒体组装。