评估微生物测序数据集检测人读的方法。

Evaluation of methods for detecting human reads in microbial sequencing datasets.

机构信息

Nuffield Department of Medicine, University of Oxford, Oxford, UK.

Organisms and Environment Division, School of Biosciences, Cardiff University, Cardiff, Wales, UK.

出版信息

Microb Genom. 2020 Jul;6(7). doi: 10.1099/mgen.0.000393.

DOI:10.1099/mgen.0.000393

PMID:32558637

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7478626/

Abstract

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

摘要

宿主相关微生物的测序数据通常可能受到研究者或研究对象身体的污染。通常通过减去比对（丢弃所有映射到人类基因组的读取）或使用读取分类工具来预测人类来源的读取，然后丢弃它们，从微生物读取中去除人类 DNA。为了提供最佳实践指南，我们使用从 10 种临床常见细菌和 3 种病毒中添加了污染人类读取的模拟数据，对 8 种基于比对和 2 种基于分类的人类读取检测方法进行了基准测试。虽然大多数方法成功地检测到 >99%的人类读取，但它们的可区分性在于方差。最精确的方法，方差极小，是 Bowtie2 和 SNAP，它们几乎没有错误地将少数（如果有的话）细菌读取（和没有病毒读取）识别为人类。虽然正确地检测到类似数量的人类读取，但基于分类的方法，如 Kraken2 和 Centrifuge，可能会将细菌读取错误地分类为人类，尽管这种情况是特定于物种的。在人类读取检测中最敏感的方法之一是 BWA，尽管它也产生了最多的假阳性分类。在所有方法中，未被识别为人类读取的那部分，尽管通常代表总读取量的<0.1%，但沿着人类基因组非随机分布，许多来自富含重复序列的性染色体。对于病毒读取和较长（>300bp）的细菌读取，表现最好的方法是基于分类的，使用 Kraken2 或 Centrifuge。对于较短（约 150bp）的细菌读取，结合多种人类读取检测方法可以最大限度地从污染的短读取数据集中恢复人类读取，而不会受到假阳性的影响。使用 Bowtie2 进行两阶段分类的方法是一种具有较高性能的方法，然后是 SNAP。使用这种方法，我们重新检查了 11577 个公开的细菌读取集，以寻找以前未检测到的人类污染。我们能够从 6%的样本中提取足够数量的读取来调用已知的人类 SNP，包括具有临床意义的 SNP。这些结果表明，表型不同的人类序列可在公开的微生物读取数据集中检测到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/383d/7478626/df9f863ca25c/mgen-6-393-g001.jpg

相似文献

Evaluation of methods for detecting human reads in microbial sequencing datasets.

Microb Genom. 2020 Jul;6(7). doi: 10.1099/mgen.0.000393.

Benchmarking metagenomics classifiers on ancient viral DNA: a simulation study.

PeerJ. 2022 Mar 24;10:e12784. doi: 10.7717/peerj.12784. eCollection 2022.

Read trimming has minimal effect on bacterial SNP-calling accuracy.

Microb Genom. 2020 Dec;6(12). doi: 10.1099/mgen.0.000434. Epub 2020 Dec 11.

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets.

BMC Bioinformatics. 2022 Dec 13;23(1):541. doi: 10.1186/s12859-022-05103-0.

Evaluation of tools for taxonomic classification of viruses.

Brief Funct Genomics. 2023 Jan 20;22(1):31-41. doi: 10.1093/bfgp/elac036.

Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics.

Genomics. 2017 Jul;109(3-4):186-191. doi: 10.1016/j.ygeno.2017.03.001. Epub 2017 Mar 9.

Comparative study of sequence aligners for detecting antibiotic resistance in bacterial metagenomes.

Lett Appl Microbiol. 2018 Mar;66(3):162-168. doi: 10.1111/lam.12842. Epub 2018 Feb 1.

Systematic benchmark of ancient DNA read mapping.

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab076.

MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.

Gigascience. 2017 Mar 1;6(3):1-10. doi: 10.1093/gigascience/gix007.

Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut.

BMC Genomics. 2014 Jan 18;15:37. doi: 10.1186/1471-2164-15-37.

引用本文的文献

RpNGS: an automated platform for pathogen identification and monitoring in clinical metagenomics data.

PeerJ. 2025 Aug 12;13:e19849. doi: 10.7717/peerj.19849. eCollection 2025.

The wound microbiome associated with deep sternal wound infection: a scoping review.

J Thorac Dis. 2025 Jul 31;17(7):5330-5346. doi: 10.21037/jtd-24-1648. Epub 2025 Jul 29.

Testing the limits of short-reads metagenomic classifications programs in wastewater treating microbial communities.

Sci Rep. 2025 Jul 5;15(1):23997. doi: 10.1038/s41598-025-07734-8.

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data.

Nat Commun. 2025 Jan 18;16(1):825. doi: 10.1038/s41467-025-56077-5.

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data.

Res Sq. 2024 Oct 23:rs.3.rs-4721159. doi: 10.21203/rs.3.rs-4721159/v1.

Deep longitudinal lower respiratory tract microbiome profiling reveals genome-resolved functional and evolutionary dynamics in critical illness.

Nat Commun. 2024 Sep 27;15(1):8361. doi: 10.1038/s41467-024-52713-8.

Clinical Metagenomic Next-Generation Sequencing for Diagnosis of Central Nervous System Infections: Advances and Challenges.

Mol Diagn Ther. 2024 Sep;28(5):513-523. doi: 10.1007/s40291-024-00727-9. Epub 2024 Jul 11.

SWGTS-a platform for stream-based host DNA depletion.

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae332.

Ten common issues with reference sequence databases and how to mitigate them.

Front Bioinform. 2024 Mar 15;4:1278228. doi: 10.3389/fbinf.2024.1278228. eCollection 2024.

Modeling the limits of detection for antimicrobial resistance genes in agri-food samples: a comparative analysis of bioinformatics tools.

BMC Microbiol. 2024 Jan 20;24(1):31. doi: 10.1186/s12866-023-03148-6.

本文引用的文献

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.

Gigascience. 2020 Feb 1;9(2). doi: 10.1093/gigascience/giaa007.

Improved metagenomic analysis with Kraken 2.

Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.

Influenza classification from short reads with VAPOR facilitates robust mapping pipelines and zoonotic strain detection for routine surveillance applications.

Bioinformatics. 2020 Mar 1;36(6):1681-1688. doi: 10.1093/bioinformatics/btz814.

Is it time to change the reference genome?

Genome Biol. 2019 Aug 9;20(1):159. doi: 10.1186/s13059-019-1774-4.

Prospective Cohort Study of Next-Generation Sequencing as a Diagnostic Modality for Unexplained Encephalitis in Children.

J Pediatric Infect Dis Soc. 2020 Jul 13;9(3):326-333. doi: 10.1093/jpids/piz032.

Listeria monocytogenes infectious periaortitis: a case report from the infectious disease standpoint.

BMC Infect Dis. 2019 Apr 16;19(1):326. doi: 10.1186/s12879-019-3953-z.

Detection of pathogens from resected heart valves of patients with infective endocarditis by next-generation sequencing.

Int J Infect Dis. 2019 Jun;83:148-153. doi: 10.1016/j.ijid.2019.03.007. Epub 2019 Mar 27.

Understanding HLA associations from SNP summary association statistics.

Sci Rep. 2019 Feb 4;9(1):1337. doi: 10.1038/s41598-018-37840-9.

GenCoF: a graphical user interface to rapidly remove human genome contaminants from metagenomic datasets.

Bioinformatics. 2019 Jul 1;35(13):2318-2319. doi: 10.1093/bioinformatics/bty963.

Assembly of a pan-genome from deep sequencing of 910 humans of African descent.

Nat Genet. 2019 Jan;51(1):30-35. doi: 10.1038/s41588-018-0273-y. Epub 2018 Nov 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估微生物测序数据集检测人读的方法。

Evaluation of methods for detecting human reads in microbial sequencing datasets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献