咨询：使用局部敏感哈希进行精确的污染去除。

CONSULT: accurate contamination removal using locality-sensitive hashing.

作者信息

Rachtman Eleonora, Bafna Vineet, Mirarab Siavash

机构信息

Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA.

Department of Computer Science and Engineering, UC San Diego, CA 92093, USA.

出版信息

NAR Genom Bioinform. 2021 Aug 5;3(3):lqab071. doi: 10.1093/nargab/lqab071. eCollection 2021 Sep.

DOI:10.1093/nargab/lqab071

PMID:34377979

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8340999/

Abstract

A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies.

摘要

许多生物信息学应用中都出现了一个基本问题

即使数据集中最匹配的序列在进化上与查询序列差异很大，测序读数是否属于某个广泛分类群的大量基因组数据集？例如，低覆盖度基因组测序（抽样）项目要么组装细胞器基因组，要么直接从未组装的读数计算基因组距离。使用未组装的读数需要进行污染检测，因为样本中通常包含来自意外物种群的读数。同样，组装细胞器基因组需要区分细胞器读数和核读数。虽然基于k-mer的方法在读取匹配方面显示出了前景，但先前的研究表明，现有方法对污染检测的敏感性不足。在这里，我们引入了一种新的读取匹配工具CONSULT，它使用局部敏感哈希测试查询中的k-mer是否落在参考数据集的用户指定距离内。利用如今可用的大内存机器，CONSULT库可容纳数万个微生物物种。我们的结果表明，与Kraken-II等领先方法相比，CONSULT在污染检测方面具有更高的真阳性率和更低的假阳性率，并改进了从基因组抽样计算的距离。我们还证明，CONSULT可以区分细胞器读数和核读数，从而显著改进基于抽样的线粒体组装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a049/8340999/ba530096348e/lqab071fig1.jpg

相似文献

CONSULT: accurate contamination removal using locality-sensitive hashing.咨询：使用局部敏感哈希进行精确的污染去除。

NAR Genom Bioinform. 2021 Aug 5;3(3):lqab071. doi: 10.1093/nargab/lqab071. eCollection 2021 Sep.

The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters.污染物对基因组掠取准确性和排除读取过滤器有效性的影响。

Mol Ecol Resour. 2020 May;20(3). doi: 10.1111/1755-0998.13135. Epub 2020 Feb 4.

CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing.CONSULT-II：基于位置敏感哈希的准确分类鉴定和特征分析。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae150.

Memory-bound -mer selection for large and evolutionary diverse reference libraries.针对大型且具有进化多样性的参考文库进行与记忆相关的 - 聚体选择。

bioRxiv. 2024 Jul 10:2024.02.12.580015. doi: 10.1101/2024.02.12.580015.

Memory-bound -mer selection for large and evolutionarily diverse reference libraries.基于记忆限制的大型且进化多样的参考文库的-mer 选择。

Genome Res. 2024 Oct 11;34(9):1455-1467. doi: 10.1101/gr.279339.124.

Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer：使用基因组草图进行无组装和无比对的样本识别。

Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.

Testing Efficacy of Assembly-Free and Alignment-Free Methods for Species Identification Using Genome Skims, with Patellogastropoda as a Test Case.利用基因组草图，免组装和免比对方法对物种鉴定的功效测试，以帽贝形腹足纲软体动物作为测试案例。

Genes (Basel). 2022 Jul 2;13(7):1192. doi: 10.3390/genes13071192.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Tracembler--software for in-silico chromosome walking in unassembled genomes.Tracembler——用于未组装基因组中电子染色体步移的软件。

BMC Bioinformatics. 2007 May 9;8:151. doi: 10.1186/1471-2105-8-151.

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.利用单分子测序和局部敏感哈希组装大型基因组。

Nat Biotechnol. 2015 Jun;33(6):623-30. doi: 10.1038/nbt.3238. Epub 2015 May 25.

引用本文的文献

Memory-bound -mer selection for large and evolutionarily diverse reference libraries.基于记忆限制的大型且进化多样的参考文库的-mer 选择。

Genome Res. 2024 Oct 11;34(9):1455-1467. doi: 10.1101/gr.279339.124.

A nuclear genome assembly of an extinct flightless bird, the little bush moa.灭绝的不会飞的鸟——小丛恐鸟的核基因组组装。

Sci Adv. 2024 May 24;10(21):eadj6823. doi: 10.1126/sciadv.adj6823. Epub 2024 May 23.

Ten common issues with reference sequence databases and how to mitigate them.参考序列数据库的十个常见问题及如何缓解这些问题。

Front Bioinform. 2024 Mar 15;4:1278228. doi: 10.3389/fbinf.2024.1278228. eCollection 2024.

CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing.CONSULT-II：基于位置敏感哈希的准确分类鉴定和特征分析。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae150.

ContScout: sensitive detection and removal of contamination from annotated genomes.ContScout：注释基因组中污染的敏感检测和去除。

Nat Commun. 2024 Jan 31;15(1):936. doi: 10.1038/s41467-024-45024-5.

GTax: improving de novo transcriptome assembly by removing foreign RNA contamination.GTax：通过去除外源 RNA 污染来提高从头转录组组装的质量。

Genome Biol. 2024 Jan 8;25(1):12. doi: 10.1186/s13059-023-03141-2.

Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling.使用子采样量化无组装全基因组距离估计和系统发育关系的不确定性。

Cell Syst. 2022 Oct 19;13(10):817-829.e3. doi: 10.1016/j.cels.2022.06.007.

Genomic Analysis of Mycobacterium abscessus Complex Isolates from Patients with Pulmonary Infection in China.中国肺部感染患者中脓肿分枝杆菌复合体分离株的基因组分析。

Microbiol Spectr. 2022 Aug 31;10(4):e0011822. doi: 10.1128/spectrum.00118-22. Epub 2022 Jul 12.

Contamination detection in genomic data: more is not enough.基因组数据中的污染检测：更多并不一定更好。

Genome Biol. 2022 Feb 21;23(1):60. doi: 10.1186/s13059-022-02619-9.

本文引用的文献

DeepMicrobes: taxonomic classification for metagenomics with deep learning.深度微生物：用于宏基因组学的深度学习分类法

NAR Genom Bioinform. 2020 Feb 19;2(1):lqaa009. doi: 10.1093/nargab/lqaa009. eCollection 2020 Mar.

A systematic comparison of chloroplast genome assembly tools.系统比较叶绿体基因组组装工具。

Genome Biol. 2020 Sep 28;21(1):254. doi: 10.1186/s13059-020-02153-6.

GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes.GetOrganelle：一个快速且通用的工具包，可用于准确从头组装细胞器基因组。

Genome Biol. 2020 Sep 10;21(1):241. doi: 10.1186/s13059-020-02154-5.

Phylogenetic double placement of mixed samples.混合样本的系统发育双重定位。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i335-i343. doi: 10.1093/bioinformatics/btaa489.

Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification.超越 DNA 条形码：基因组 skimming 数据在样本鉴定中的未实现潜力。

Mol Ecol. 2020 Jul;29(14):2521-2534. doi: 10.1111/mec.15507. Epub 2020 Jun 29.

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0.基于 PhyloPhlAn 3.0 对宏基因组中的微生物分离株和基因组进行精确的系统发育分析。

Nat Commun. 2020 May 19;11(1):2500. doi: 10.1038/s41467-020-16366-7.

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.终止污染：大规模搜索在 GenBank 中发现超过 200 万条污染条目。

Genome Biol. 2020 May 12;21(1):115. doi: 10.1186/s13059-020-02023-1.

A complete domain-to-species taxonomy for Bacteria and Archaea.细菌和古菌的完整域到种分类 taxonomy。

Nat Biotechnol. 2020 Sep;38(9):1079-1086. doi: 10.1038/s41587-020-0501-8. Epub 2020 Apr 27.

The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters.污染物对基因组掠取准确性和排除读取过滤器有效性的影响。

Mol Ecol Resour. 2020 May;20(3). doi: 10.1111/1755-0998.13135. Epub 2020 Feb 4.

Large scale genome skimming from herbarium material for accurate plant identification and phylogenomics.从标本馆材料中进行大规模基因组浅层测序以实现准确的植物鉴定和系统发育基因组学研究。

Plant Methods. 2020 Jan 4;16:1. doi: 10.1186/s13007-019-0534-5. eCollection 2020.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

咨询：使用局部敏感哈希进行精确的污染去除。

CONSULT: accurate contamination removal using locality-sensitive hashing.

作者信息

机构信息

出版信息

许多生物信息学应用中都出现了一个基本问题

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献