Podder Mohua, Welch William J, Zamar Ruben H, Tebbutt Scott J
Department of Statistics, University of British Columbia, Vancouver, BC, Canada.
BMC Bioinformatics. 2006 Nov 30;7:521. doi: 10.1186/1471-2105-7-521.
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide--adenine (A), thymine (T), cytosine (C) or guanine (G)--is altered. Arguably, SNPs account for more than 90% of human genetic variation. Our laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). This mini-sequencing method is a powerful combination of a highly parallel microarray with distinctive Sanger-based dideoxy terminator sequencing chemistry. Using this microarray platform, our current genotype calling system (known as SNP Chart) is capable of calling single SNP genotypes by manual inspection of the APEX data, which is time-consuming and exposed to user subjectivity bias.
Using a set of 32 Coriell DNA samples plus three negative PCR controls as a training data set, we have developed a fully-automated genotyping algorithm based on simple linear discriminant analysis (LDA) using dynamic variable selection. The algorithm combines separate analyses based on the multiple probe sets to give a final posterior probability for each candidate genotype. We have tested our algorithm on a completely independent data set of 270 DNA samples, with validated genotypes, from patients admitted to the intensive care unit (ICU) of St. Paul's Hospital (plus one negative PCR control sample). Our method achieves a concordance rate of 98.9% with a 99.6% call rate for a set of 96 SNPs. By adjusting the threshold value for the final posterior probability of the called genotype, the call rate reduces to 94.9% with a higher concordance rate of 99.6%. We also reversed the two independent data sets in their training and testing roles, achieving a concordance rate up to 99.8%.
The strength of this APEX chemistry-based platform is its unique redundancy having multiple probes for a single SNP. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any 'bad data' corresponding to image artifacts on the microarray slide or failure of a specific chemistry. In this regard, our method is able to automatically select the probes which work well and reduce the effect of other so-called bad performing probes in a sample-specific manner, for any number of SNPs.
单核苷酸多态性(SNP)是指当单个核苷酸——腺嘌呤(A)、胸腺嘧啶(T)、胞嘧啶(C)或鸟嘌呤(G)——发生改变时出现的DNA序列变异。可以说,SNP占人类遗传变异的90%以上。我们实验室基于阵列引物延伸(APEX)技术开发了一种高度冗余的SNP基因分型检测方法,该方法由针对单个SNP的多个探针组成,这些探针具有来自多个通道的信号。这种微测序方法是高度平行微阵列与独特的基于桑格双脱氧终止子测序化学方法的强大结合。利用这个微阵列平台,我们目前的基因型分型系统(称为SNP Chart)能够通过人工检查APEX数据来对单个SNP基因型进行分型,这既耗时又容易受到用户主观偏差的影响。
我们以一组32个Coriell DNA样本加上3个阴性PCR对照作为训练数据集,开发了一种基于简单线性判别分析(LDA)并使用动态变量选择的全自动基因分型算法。该算法结合基于多个探针集的单独分析,为每个候选基因型给出最终的后验概率。我们在一个完全独立的包含270个DNA样本(其基因型已得到验证)的数据集上测试了我们的算法,这些样本来自圣保罗医院重症监护病房(ICU)的患者(加上1个阴性PCR对照样本)。对于一组96个SNP,我们的方法一致性率达到98.9%,分型成功率为99.6%。通过调整所调用基因型最终后验概率的阈值,分型成功率降至94.9%,但一致性率更高,达到99.6%。我们还将两个独立数据集的训练和测试角色进行了互换,一致性率高达99.8%。
这个基于APEX化学技术的平台的优势在于其独特的冗余性,即针对单个SNP有多个探针。我们基于模型的基因型分型算法考虑了特定SNP的所有潜在探针特征,捕捉了系统中的冗余性,自动降低了与微阵列载玻片上图像伪影或特定化学反应失败相对应的任何“不良数据”的权重。在这方面,我们的方法能够以样本特异性的方式自动选择性能良好的探针,并减少其他所谓性能不佳的探针的影响,适用于任意数量的SNP。