Ni Guiyan, Strom Tim M, Pausch Hubert, Reimer Christian, Preisinger Rudolf, Simianer Henner, Erbe Malena
Animal Breeding and Genetics Group, Georg-August-Universität, Göttingen, Germany.
Institute of Human Genetics, Helmholtz Zentrum München, Neuherberg, Germany.
BMC Genomics. 2015 Oct 21;16:824. doi: 10.1186/s12864-015-2059-2.
The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals, which have been genotyped for a subset of SNPs using a genotyping array.
First, we compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual, and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from around 1000 individuals from six different generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios.
There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6 % (81.6 %, 88.0 %) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC) defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array were 0.98 (GATK), 0.97 (freebayes) and 0.98 (SAMtools). Furthermore, the percentage of variants that had high values (>0.9) for another three measures (non-reference sensitivity, non-reference genotype concordance and precision) were 90 (88, 75) for GATK (SAMtools, freebayes). With all imputation programs, correlation between original and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals.
Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, while it had lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals.
过去十年的技术进步使得在相对较短的时间内对数百万条DNA读数进行测序成为可能。基于不同算法的几种变异检测工具应运而生,使得从全基因组序列中提取单核苷酸多态性(SNP)成为可能。通常,一个群体中只有少数个体被完全测序,然后使用推算方法来获取其他个体所有基于序列的SNP位点的基因型,这些个体使用基因分型芯片对一部分SNP进行了基因分型。
首先,我们比较了使用不同变异检测工具(即GATK、freebayes和SAMtools)检测到的变异集,并在一组50个全测序的白来航鸡和褐来航鸡中检查了所检测变异的基因型质量。其次,我们使用来自六个不同世代的约1000个个体的数据,评估了从高密度SNP芯片数据推算到全基因组序列时的推算准确性(以每个SNP和每个个体的推算基因型与真实基因型之间的相关性以及父子对之间的基因型冲突来衡量)。在不同的验证场景中检查了三个不同的推算程序(Minimac、FImpute和IMPUTE2)。
在研究的3号、6号和28号染色体上,所有三个检测工具共检测到1,74,573个SNP,占GATK(SAMtools、freebayes)检测到的SNP总数的71.6%(81.6%、88.0%)。基因型一致性(GC)定义为在芯片上所有非缺失SNP中,其芯片衍生基因型与序列衍生基因型相同的个体比例,GATK为0.98,freebayes为0.97,SAMtools为0.98。此外,对于另外三个指标(非参考敏感性、非参考基因型一致性和精确性)具有高值(>0.9)的变异百分比,GATK为90(SAMtools为88,freebayes为75)。使用所有推算程序,对于从SNP芯片中随机屏蔽的1000个SNP,原始基因型与推算基因型之间的平均相关性>0.95,对于测序个体中的留一法交叉验证,相关性>0.85。
总体而言,所有研究的变异检测工具的性能都非常好,特别是GATK和SAMtools。在基因型相关性方面,FImpute的表现略逊于Minimac和IMPUTE2,尤其是对于次要等位基因频率较低的SNP,而在可用的父子对中,其孟德尔冲突数量最少。即使要推算的个体与测序个体相隔几代,真实基因型与推算基因型之间的相关性仍然持续很高。