Sabatti Chiara, Lange Kenneth
Departments of Human Genetics and Statistics, University of California, Los Angeles, CA 90095.
J Am Stat Assoc. 2008 Mar 1;103(481):89-100. doi: 10.1198/016214507000000338..
Affymetrix's SNP (single-nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene-mapping studies. Because each SNP is queried by multiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of an SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. This article describes our attempt to improve genotype calling by constructing Gaussian mixture models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, our models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype-calling model with pedigree analysis and examine a sequence of symmetry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the Bayesian information criterion, we select the best combination of symmetry assumptions. Compared to Affymetrix's software, our model leads to a reduction in no-calls with little sacrifice in overall calling accuracy.
Affymetrix公司的单核苷酸多态性(SNP)基因分型芯片扩大了基因图谱研究的范围并降低了其成本。由于每个SNP由多个DNA探针进行检测,这些芯片在基因型判定方面带来了有趣的挑战。在有足够多无关个体样本或已知基因型训练样本的情况下,传统聚类方法能较好地区分SNP的三种基因型。本文描述了我们通过构建具有经验推导先验概率的高斯混合模型来改进基因型判定的尝试。这些先验概率稳定了参数估计,并借鉴了在数万个SNP上共同收集的信息。当有来自相关家庭成员的数据时,我们的模型能够捕捉亲属间信号的相关性。基于这些优势,我们将模型应用于在八个家系中63个已基因分型个体上收集的10000个SNP的Affymetrix探针强度数据。我们将基因型判定模型与系谱分析相结合,并检验一系列涉及相关探针信号的对称性假设。这些对称性假设引发了参数化方面新的数学问题。使用贝叶斯信息准则,我们选择对称性假设的最佳组合。与Affymetrix的软件相比,我们的模型在总体判定准确性几乎没有牺牲的情况下,减少了无法判定的情况。