Suppr超能文献

斑马:具有重叠参考的静态和动态基因组覆盖阈值。

Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References.

机构信息

Department of Pediatrics, School of Medicine, University of California, San Diegogrid.266100.3, La Jolla, California, USA.

Bioinformatics and Systems Biology Program, University of California, San Diegogrid.266100.3, La Jolla, California, USA.

出版信息

mSystems. 2022 Oct 26;7(5):e0075822. doi: 10.1128/msystems.00758-22. Epub 2022 Sep 8.

Abstract

Assigning taxonomy remains a challenging topic in microbiome studies, due largely to ambiguity of reads which overlap multiple reference genomes. With the Web of Life (WoL) reference database hosting 10,575 reference genomes and growing, the percentage of ambiguous reads will only increase. The resulting artifacts create both the illusion of co-occurrence and a long tail end of extraneous reference hits that confound interpretation. We introduce genome cover, the fraction of reference genome overlapped by reads, to distinguish these artifacts. We show how to dynamically predict genome cover by read count and examine our model in Staphylococcus aureus monoculture. Our modeling cleanly separates both S. aureus and true contaminants from the false artifacts of reference overlap. We next introduce saturated genome cover, the true fraction of a reference genome overlapped by sample contents. Genome cover may not saturate for low abundance or low prevalence bacteria. We assuage this worry with examination of a large human fecal data set. By compositing the metric across like samples, genome cover saturates even for rare species. We note that it is a threshold on saturated genome cover, not genome cover itself, which indicates a spurious reference hit or distant relative. We present Zebra, a method to compute and threshold the genome cover metric across like samples, a recurrence to estimate genome cover and confirm saturation, and provide guidance for choosing cover thresholds in real world scenarios. Standalone genome cover and integration into Woltka are available: https://github.com/biocore/zebra_filter, https://github.com/qiyunzhu/woltka. Taxonomic assignment, assigning sequences to specific taxonomic units, is a crucial processing step in microbiome analyses. Issues in taxonomic assignment affect interpretation of what microbes are present in each sample and may be associated with specific environmental or clinical conditions. Assigning importance to a particular taxon relies strongly on independence of assigned counts. The false inclusion of thousands of correlated taxa makes interpretation ambiguous, leading to underconstrained results which cannot be reproduced. The importance sometimes attached to implausible artifacts such as anthrax or bubonic plague is especially problematic. We show that the Zebra filter retrieves only the nearest relatives of sample contents enabling more reproducible and biologically plausible interpretation of metagenomic data.

摘要

在微生物组研究中,分类学的分配仍然是一个具有挑战性的话题,主要是由于读取的模糊性,这些读取与多个参考基因组重叠。随着生命之网(WoL)参考数据库托管 10,575 个参考基因组并且还在不断增长,模糊读取的百分比只会增加。由此产生的伪影既创造了共同出现的假象,又产生了大量混淆解释的多余参考命中的长尾。我们引入了基因组覆盖率,即被读取覆盖的参考基因组的分数,以区分这些伪影。我们展示了如何通过读取计数动态预测基因组覆盖率,并在金黄色葡萄球菌单培养中检查我们的模型。我们的模型干净地将金黄色葡萄球菌和真正的污染物与参考重叠的虚假伪影区分开来。接下来,我们引入饱和基因组覆盖率,即样品内容覆盖参考基因组的真实分数。对于低丰度或低流行率的细菌,基因组覆盖率可能不会饱和。我们通过检查大量人类粪便数据集来缓解这种担忧。通过在类似样本中组合该指标,即使对于稀有物种,基因组覆盖率也会饱和。我们注意到,指示参考命中或远亲是假的是饱和基因组覆盖率的阈值,而不是基因组覆盖率本身。我们提出了 Zebra,这是一种在类似样本中计算和阈值化基因组覆盖率指标的方法,这是一种递归方法,用于估计基因组覆盖率并确认饱和度,并为在实际场景中选择覆盖阈值提供指导。可单独使用基因组覆盖率,并集成到 Woltka 中:https://github.com/biocore/zebra_filter,https://github.com/qiyunzhu/woltka。分类学分配,即将序列分配给特定的分类单元,是微生物组分析中的一个关键处理步骤。分类学分配中的问题会影响对每个样本中存在的微生物的解释,并且可能与特定的环境或临床条件有关。对特定分类单元的重要性的重视强烈依赖于分配计数的独立性。数千个相关分类单元的虚假包含使得解释变得模糊,导致结果约束不足,无法重现。尤其是对炭疽或黑死病等不合理伪影的重视尤其成问题。我们表明,Zebra 过滤器仅检索样本内容的最近亲属,从而使元基因组数据的解释更具可重复性和更符合生物学。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3bc6/9600373/e3eb52a2fc7b/msystems.00758-22-f001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验