Presson Angela P, Sobel Eric, Lange Kenneth, Papp Jeanette C
Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
J Comput Biol. 2006 Jul-Aug;13(6):1131-47. doi: 10.1089/cmb.2006.13.1131.
Genotype calling procedures vary from laboratory to laboratory for many microsatellite markers. Even within the same laboratory, application of different experimental protocols often leads to ambiguities. The impact of these ambiguities ranges from irksome to devastating. Resolving the ambiguities can increase effective sample size and preserve evidence in favor of disease-marker associations. Because different data sets may contain different numbers of alleles, merging is unfortunately not a simple process of matching alleles one to one. Merging data sets manually is difficult, time-consuming, and error-prone due to differences in genotyping hardware, binning methods, molecular weight standards, and curve fitting algorithms. Merging is particularly difficult if few or no samples occur in common, or if samples are drawn from ethnic groups with widely varying allele frequencies. It is dangerous to align alleles simply by adding a constant number of base pairs to the alleles of one of the data sets. To address these issues, we have developed a Bayesian model and a Markov chain Monte Carlo (MCMC) algorithm for sampling the posterior distribution under the model. Our computer program, MicroMerge, implements the algorithm and almost always accurately and efficiently finds the most likely correct alignment. Common allele frequencies across laboratories in the same ethnic group are the single most important cue in the model. MicroMerge computes the allelic alignments with the greatest posterior probabilities under several merging options. It also reports when data sets cannot be confidently merged. These features are emphasized in our analysis of simulated and real data.
对于许多微卫星标记物,不同实验室的基因型分型程序各不相同。即使在同一实验室中,应用不同的实验方案也常常会导致结果不明确。这些不明确性的影响范围从令人厌烦到具有毁灭性。解决这些不明确性可以增加有效样本量,并保留有利于疾病标记物关联的证据。由于不同的数据集可能包含不同数量的等位基因,不幸的是,合并并不是一个简单的一对一匹配等位基因的过程。由于基因分型硬件、分箱方法、分子量标准和曲线拟合算法的差异,手动合并数据集既困难、耗时,又容易出错。如果很少有或没有共同的样本,或者样本来自等位基因频率差异很大的种族群体,合并就会特别困难。仅仅通过给其中一个数据集的等位基因添加固定数量的碱基对来对齐等位基因是很危险的。为了解决这些问题,我们开发了一种贝叶斯模型和一种马尔可夫链蒙特卡罗(MCMC)算法,用于对模型下的后验分布进行采样。我们的计算机程序MicroMerge实现了该算法,并且几乎总是能够准确、高效地找到最可能正确的对齐方式。同一民族不同实验室之间的常见等位基因频率是该模型中最重要的单一线索。MicroMerge在几种合并选项下计算具有最大后验概率的等位基因对齐方式。它还会报告数据集何时无法可靠合并。在我们对模拟数据和真实数据的分析中强调了这些特征。