Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA; Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
Cell Syst. 2023 Feb 15;14(2):160-176.e3. doi: 10.1016/j.cels.2022.12.007. Epub 2023 Jan 18.
Detecting genetic variants in metagenomic data is a priority for understanding the evolution, ecology, and functional characteristics of microbial communities. Many tools that perform this metagenotyping rely on aligning reads of unknown origin to a database of sequences from many species before calling variants. In this synthesis, we investigate how databases of increasingly diverse and closely related species have pushed the limits of current alignment algorithms, thereby degrading the performance of metagenotyping tools. We identify multi-mapping reads as a prevalent source of errors and illustrate a trade-off between retaining correct alignments versus limiting incorrect alignments, many of which map reads to the wrong species. Then we evaluate several actionable mitigation strategies and review emerging methods showing promise to further improve metagenotyping in response to the rapid growth in genome collections. Our results have implications beyond metagenotyping to the many tools in microbial genomics that depend upon accurate read mapping.
在宏基因组数据中检测遗传变异是理解微生物群落的进化、生态和功能特征的首要任务。许多执行这种宏基因分型的工具都依赖于将未知来源的读取与来自多种物种的序列数据库进行比对,然后再调用变体。在这项综合研究中,我们研究了越来越多样化和密切相关的物种数据库如何推动当前比对算法的极限,从而降低宏基因分型工具的性能。我们将多映射读取识别为错误的常见来源,并说明了保留正确比对与限制错误比对之间的权衡,其中许多比对将读取映射到错误的物种。然后,我们评估了几种可行的缓解策略,并回顾了新兴方法,这些方法有望进一步改进宏基因分型,以应对基因组集合的快速增长。我们的研究结果不仅对宏基因分型有影响,而且对微生物基因组学中许多依赖于准确读取映射的工具也有影响。