使用宏基因组短读长测序对基因变异检测流程进行基准测试

A Benchmark of Genetic Variant Calling Pipelines Using Metagenomic Short-Read Sequencing.

作者信息

Andreu-Sánchez Sergio, Chen Lianmin, Wang Daoming, Augustijn Hannah E, Zhernakova Alexandra, Fu Jingyuan

机构信息

Department of Genetics, University of Groningen and University Medical Center Groningen, Groningen, Netherlands.

Department of Pediatrics, University of Groningen and University Medical Center Groningen, Groningen, Netherlands.

出版信息

Front Genet. 2021 May 10;12:648229. doi: 10.3389/fgene.2021.648229. eCollection 2021.

DOI:10.3389/fgene.2021.648229

PMID:34040632

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8141913/

Abstract

Microbes live in complex communities that are of major importance for environmental ecology, public health, and animal physiology and pathology. Short-read metagenomic shotgun sequencing is currently the state-of-the-art technique for exploring these communities. With the aid of metagenomics, our understanding of the microbiome is moving from composition toward functionality, even down to the genetic variant level. While the exploration of single-nucleotide variation in a genome is a standard procedure in genomics, and many sophisticated tools exist to perform this task, identification of genetic variation in metagenomes remains challenging. Major factors that hamper the widespread application of variant-calling analysis include low-depth sequencing of individual genomes (which is especially significant for the microorganisms present in low abundance), the existence of large genomic variation even within the same species, the absence of comprehensive reference genomes, and the noise introduced by next-generation sequencing errors. Some bioinformatics tools, such as metaSNV or InStrain, have been created to identify genetic variants in metagenomes, but the performance of these tools has not been systematically assessed or compared with the variant callers commonly used on single or pooled genomes. In this study, we benchmark seven bioinformatic tools for genetic variant calling in metagenomics data and assess their performance. To do so, we simulated metagenomic reads to mimic human microbial composition, sequencing errors, and genetic variability. We also simulated different conditions, including low and high depth of coverage and unique or multiple strains per species. Our analysis of the simulated data shows that probabilistic method-based tools such as HaplotypeCaller and Mutect2 from the GATK toolset show the best performance. By applying these tools to longitudinal gut microbiome data from the Human Microbiome Project, we show that the genetic similarity between longitudinal samples from the same individuals is significantly greater than the similarity between samples from different individuals. Our benchmark shows that probabilistic tools can be used to call metagenomes, and we recommend the use of GATK's tools as reliable variant callers for metagenomic samples.

摘要

微生物生活在复杂的群落中，这些群落对环境生态学、公共卫生以及动物生理学和病理学至关重要。短读长宏基因组鸟枪法测序是目前探索这些群落的最先进技术。借助宏基因组学，我们对微生物组的理解正从组成层面转向功能层面，甚至深入到基因变异水平。虽然在基因组中探索单核苷酸变异是基因组学中的标准程序，并且有许多复杂的工具来执行此任务，但在宏基因组中识别基因变异仍然具有挑战性。阻碍变异检测分析广泛应用的主要因素包括单个基因组的低深度测序（这对于低丰度存在的微生物尤为重要）、即使在同一物种内也存在大量的基因组变异、缺乏全面的参考基因组以及下一代测序错误引入的噪声。已经创建了一些生物信息学工具，如metaSNV或InStrain，来识别宏基因组中的基因变异，但这些工具的性能尚未得到系统评估，也未与单基因组或混合基因组常用的变异检测工具进行比较。在本研究中，我们对七种用于宏基因组学数据基因变异检测的生物信息学工具进行了基准测试，并评估了它们的性能。为此，我们模拟了宏基因组读数，以模拟人类微生物组成、测序错误和基因变异性。我们还模拟了不同的条件，包括低覆盖度和高覆盖度以及每个物种的独特菌株或多个菌株。我们对模拟数据的分析表明，基于概率方法的工具，如来自GATK工具集的HaplotypeCaller和Mutect2，表现最佳。通过将这些工具应用于人类微生物组计划的纵向肠道微生物组数据，我们表明来自同一个体的纵向样本之间的基因相似性明显大于来自不同个体的样本之间的相似性。我们的基准测试表明，概率工具可用于检测宏基因组，并且我们建议使用GATK的工具作为宏基因组样本可靠的变异检测工具。