Liu Ruijie, Dai Zhiyin, Yeager Meredith, Irizarry Rafael A, Ritchie Matthew E
Molecular Medicine Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.
BMC Bioinformatics. 2014 May 23;15:158. doi: 10.1186/1471-2105-15-158.
SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the underlying models rely on each genotype having a reasonable number of observations to ensure accurate clustering.
Here we develop KRLMM, a new method for converting raw intensities into genotype calls that aims to overcome this issue. Our method is unique in that it applies careful between sample normalization and allows a variable number of clusters k (1, 2 or 3) for each SNP, where k is predicted using the available data. We compare our method to four genotyping algorithms (GenCall, GenoSNP, Illuminus and OptiCall) on several Illumina data sets that include samples from the HapMap project where the true genotypes are known in advance. All methods were found to have high overall accuracy (> 98%), with KRLMM consistently amongst the best. At low minor allele frequency, the KRLMM, OptiCall and GenoSNP algorithms were observed to be consistently more accurate than GenCall and Illuminus on our test data.
Methods that tailor their approach to calling low frequency variants by either varying the number of clusters (KRLMM) or using information from other SNPs (OptiCall and GenoSNP) offer improved accuracy over methods that do not (GenCall and Illuminus). The KRLMM algorithm is implemented in the open-source crlmm package distributed via the Bioconductor project (http://www.bioconductor.org).
单核苷酸多态性(SNP)基因分型微阵列彻底改变了复杂疾病的研究。当前市面上可买到的基因分型产品涵盖了大量低频和罕见变异的目录。现有的SNP分型算法难以处理这些低频变异,因为其基础模型依赖于每个基因型有合理数量的观测值以确保准确聚类。
在此,我们开发了KRLMM,一种将原始强度转换为基因型分型的新方法,旨在克服这一问题。我们的方法独特之处在于它进行了仔细的样本间归一化,并允许每个SNP有可变数量的聚类k(1、2或3),其中k是根据可用数据预测的。我们在几个Illumina数据集上,将我们的方法与四种基因分型算法(GenCall、GenoSNP、Illuminus和OptiCall)进行比较,这些数据集包括来自HapMap项目的样本,其真实基因型是预先已知的。所有方法的总体准确率都很高(>98%),KRLMM始终名列前茅。在低次要等位基因频率下,在我们的测试数据中,观察到KRLMM、OptiCall和GenoSNP算法始终比GenCall和Illuminus更准确。
通过改变聚类数量(KRLMM)或使用来自其他SNP的信息(OptiCall和GenoSNP)来调整其方法以对低频变异进行分型的方法,比不这样做的方法(GenCall和Illuminus)具有更高的准确性。KRLMM算法在通过Bioconductor项目(http://www.bioconductor.org)分发的开源crlmm包中实现。