Stephens Matthew, Sloan James S, Robertson P D, Scheet Paul, Nickerson Deborah A
Department of Statistics, University of Washington, Seattle, Washington 98195, USA.
Nat Genet. 2006 Mar;38(3):375-81. doi: 10.1038/ng1746. Epub 2006 Feb 19.
The detection of sequence variation, for which DNA sequencing has emerged as the most sensitive and automated approach, forms the basis of all genetic analysis. Here we describe and illustrate an algorithm that accurately detects and genotypes SNPs from fluorescence-based sequence data. Because the algorithm focuses particularly on detecting SNPs through the identification of heterozygous individuals, it is especially well suited to the detection of SNPs in diploid samples obtained after DNA amplification. It is substantially more accurate than existing approaches and, notably, provides a useful quantitative measure of its confidence in each potential SNP detected and in each genotype called. Calls assigned the highest confidence are sufficiently reliable to remove the need for manual review in several contexts. For example, for sequence data from 47-90 individuals sequenced on both the forward and reverse strands, the highest-confidence calls from our algorithm detected 93% of all SNPs and 100% of high-frequency SNPs, with no false positive SNPs identified and 99.9% genotyping accuracy. This algorithm is implemented in a software package, PolyPhred version 5.0, which is freely available for academic use.
序列变异检测是所有遗传分析的基础,而DNA测序已成为最灵敏且自动化的序列变异检测方法。本文描述并阐释了一种算法,该算法可从基于荧光的序列数据中准确检测单核苷酸多态性(SNP)并进行基因分型。由于该算法特别专注于通过识别杂合个体来检测SNP,因此特别适用于检测DNA扩增后获得的二倍体样本中的SNP。它比现有方法更为准确,并且值得注意的是,它能对检测到的每个潜在SNP和每个判定的基因型提供有用的可信度定量测量。赋予最高可信度的判定足够可靠,在多种情况下无需人工审核。例如,对于在正向和反向链上测序的47至90个个体的序列数据,我们算法中最高可信度的判定检测到了所有SNP的93%以及高频SNP的100%,未发现假阳性SNP,基因分型准确率达99.9%。该算法已在软件包PolyPhred版本5.0中实现,可供学术使用且免费。