Morden Research and Development Centre, Agriculture and Agri-Food Canada, 101 Route 100, Morden, Manitoba, R6M 1Y5, Canada.
Ottawa Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, Ontario, K1A 0C6, Canada.
BMC Bioinformatics. 2020 Aug 17;21(1):360. doi: 10.1186/s12859-020-03704-1.
Discovering single nucleotide polymorphisms (SNPs) from agriculture crop genome sequences has been a widely used strategy for developing genetic markers for several applications including marker-assisted breeding, population diversity studies for eco-geographical adaption, genotyping crop germplasm collections, and others. Accurately detecting SNPs from large polyploid crop genomes such as wheat is crucial and challenging. A few variant calling methods have been previously developed but they show a low concordance between their variant calls. A gold standard of variant sets generated from one human individual sample was established for variant calling tool evaluations, however hitherto no gold standard of crop variant set is available for wheat use. The intent of this study was to evaluate seven SNP variant calling tools (FreeBayes, GATK, Platypus, Samtools/mpileup, SNVer, VarScan, VarDict) with the two most popular mapping tools (BWA-mem and Bowtie2) on wheat whole exome capture (WEC) re-sequencing data from allohexaploid wheat.
We found the BWA-mem mapping tool had both a higher mapping rate and a higher accuracy rate than Bowtie2. With the same mapping quality (MQ) cutoff, BWA-mem detected more variant bases in mapping reads than Bowtie2. The reads preprocessed with quality trimming or duplicate removal did not significantly affect the final mapping performance in terms of mapped reads. Based on the concordance and receiver operating characteristic (ROC), the Samtools/mpileup variant calling tool with BWA-mem mapping of raw sequence reads outperformed other tests followed by FreeBayes and GATK in terms of specificity and sensitivity. VarDict and VarScan were the poorest performing variant calling tools with the wheat WEC sequence data.
The BWA-mem and Samtools/mpileup pipeline, with no need to preprocess the raw read data before mapping onto the reference genome, was ascertained the optimum for SNP calling for the complex wheat genome re-sequencing. These results also provide useful guidelines for reliable variant identification from deep sequencing of other large polyploid crop genomes.
从农业作物基因组序列中发现单核苷酸多态性(SNP),一直以来都是一种广泛应用的策略,可用于多种应用,包括标记辅助育种、生态地理适应的群体多样性研究、作物种质资源的基因型分析等。准确检测如小麦等大型多倍体作物基因组中的 SNP 至关重要,但也极具挑战性。之前已经开发了一些变异调用方法,但它们的变异调用之间一致性较低。虽然已经建立了一个来自人类个体样本的变异集的黄金标准,用于变异调用工具评估,但迄今为止,还没有适用于小麦的作物变异集黄金标准。本研究的目的是评估七种 SNP 变异调用工具(FreeBayes、GATK、Platypus、Samtools/mpileup、SNVer、VarScan、VarDict)在六倍体小麦全外显子捕获(WEC)重测序数据上与两种最流行的映射工具(BWA-mem 和 Bowtie2)的使用情况。
我们发现 BWA-mem 映射工具的映射率和准确率都高于 Bowtie2。在相同的映射质量(MQ)截止值下,BWA-mem 在映射读段中检测到的变异碱基比 Bowtie2 多。经过质量修剪或去除重复序列预处理的读段,在映射读段方面不会显著影响最终的映射性能。根据一致性和接收器工作特性(ROC),Samtools/mpileup 变异调用工具与 BWA-mem 映射原始序列读段的组合表现优于其他测试,其次是 FreeBayes 和 GATK 在特异性和敏感性方面。在使用小麦 WEC 序列数据时,VarDict 和 VarScan 是性能最差的变异调用工具。
对于复杂的小麦基因组重测序,不需要在映射到参考基因组之前对原始读段数据进行预处理的 BWA-mem 和 Samtools/mpileup 组合,被确定为 SNP 调用的最佳方法。这些结果还为其他大型多倍体作物基因组的深度测序中可靠的变异鉴定提供了有用的指导。