Suppr超能文献

评估微生物测序数据集检测人读的方法。

Evaluation of methods for detecting human reads in microbial sequencing datasets.

机构信息

Nuffield Department of Medicine, University of Oxford, Oxford, UK.

Organisms and Environment Division, School of Biosciences, Cardiff University, Cardiff, Wales, UK.

出版信息

Microb Genom. 2020 Jul;6(7). doi: 10.1099/mgen.0.000393.

Abstract

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

摘要

宿主相关微生物的测序数据通常可能受到研究者或研究对象身体的污染。通常通过减去比对(丢弃所有映射到人类基因组的读取)或使用读取分类工具来预测人类来源的读取,然后丢弃它们,从微生物读取中去除人类 DNA。为了提供最佳实践指南,我们使用从 10 种临床常见细菌和 3 种病毒中添加了污染人类读取的模拟数据,对 8 种基于比对和 2 种基于分类的人类读取检测方法进行了基准测试。虽然大多数方法成功地检测到 >99%的人类读取,但它们的可区分性在于方差。最精确的方法,方差极小,是 Bowtie2 和 SNAP,它们几乎没有错误地将少数(如果有的话)细菌读取(和没有病毒读取)识别为人类。虽然正确地检测到类似数量的人类读取,但基于分类的方法,如 Kraken2 和 Centrifuge,可能会将细菌读取错误地分类为人类,尽管这种情况是特定于物种的。在人类读取检测中最敏感的方法之一是 BWA,尽管它也产生了最多的假阳性分类。在所有方法中,未被识别为人类读取的那部分,尽管通常代表总读取量的<0.1%,但沿着人类基因组非随机分布,许多来自富含重复序列的性染色体。对于病毒读取和较长(>300bp)的细菌读取,表现最好的方法是基于分类的,使用 Kraken2 或 Centrifuge。对于较短(约 150bp)的细菌读取,结合多种人类读取检测方法可以最大限度地从污染的短读取数据集中恢复人类读取,而不会受到假阳性的影响。使用 Bowtie2 进行两阶段分类的方法是一种具有较高性能的方法,然后是 SNAP。使用这种方法,我们重新检查了 11577 个公开的细菌读取集,以寻找以前未检测到的人类污染。我们能够从 6%的样本中提取足够数量的读取来调用已知的人类 SNP,包括具有临床意义的 SNP。这些结果表明,表型不同的人类序列可在公开的微生物读取数据集中检测到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/383d/7478626/df9f863ca25c/mgen-6-393-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验