Fu Bilin, Xu Jin
Department of Statistics and Actuarial Science, East China Normal University, 500 Dongchuan Road, Shanghai 200241, PR China.
J Bioinform Comput Biol. 2011 Dec;9(6):715-28. doi: 10.1142/s0219720011005458.
Current genotype-calling methods such as Robust Linear Model with Mahalanobis Distance Classifier (RLMM) and Corrected Robust Linear Model with Maximum Likelihood Classification (CRLMM) provide accurate calling results for Affymetrix Single Nucleotide Polymorphisms (SNP) chips. However, these methods are computationally expensive as they employ preprocess procedures, including chip data normalization and other sophisticated statistical techniques. In the small sample case the accuracy rate may drop significantly. We develop a new genotype calling method for Affymetrix 100 k and 500 k SNP chips. A two-stage classification scheme is proposed to obtain a fast genotype calling algorithm. The first stage uses unsupervised classification to quickly discriminate genotypes with high accuracy for the majority of the SNPs. And the second stage employs a supervised classification method to incorporate allele frequency information either from the HapMap data or from a self-training scheme. Confidence score is provided for every genotype call. The overall performance is shown to be comparable to that of CRLMM as verified by the known gold standard HapMap data and is superior in small sample cases. The new algorithm is computationally simple and standalone in the sense that a self-training scheme can be used without employing any other training data. A package implementing the calling algorithm is freely available at http://www.sfs.ecnu.edu.cn/teachers/xuj_en.html.
当前的基因分型方法,如带马氏距离分类器的稳健线性模型(RLMM)和带最大似然分类的校正稳健线性模型(CRLMM),为Affymetrix单核苷酸多态性(SNP)芯片提供了准确的分型结果。然而,这些方法计算成本高昂,因为它们采用了预处理程序,包括芯片数据归一化和其他复杂的统计技术。在小样本情况下,准确率可能会显著下降。我们为Affymetrix 100 k和500 k SNP芯片开发了一种新的基因分型方法。提出了一种两阶段分类方案以获得快速的基因分型算法。第一阶段使用无监督分类,以高精度快速区分大多数SNP的基因型。第二阶段采用监督分类方法,纳入来自HapMap数据或自训练方案的等位基因频率信息。为每个基因分型结果提供置信度得分。如通过已知的金标准HapMap数据验证的那样,整体性能显示与CRLMM相当,并且在小样本情况下更优。新算法计算简单且独立,即可以使用自训练方案而无需使用任何其他训练数据。实现该分型算法的软件包可在http://www.sfs.ecnu.edu.cn/teachers/xuj_en.html免费获取。