School of Information and Communication Technology, CQUniversity, Rockhampton 4702, Australia.
BMC Bioinformatics. 2010 Oct 22;11:529. doi: 10.1186/1471-2105-11-529.
The information provided by dense genome-wide markers using high throughput technology is of considerable potential in human disease studies and livestock breeding programs. Genome-wide association studies relate individual single nucleotide polymorphisms (SNP) from dense SNP panels to individual measurements of complex traits, with the underlying assumption being that any association is caused by linkage disequilibrium (LD) between SNP and quantitative trait loci (QTL) affecting the trait. Often SNP are in genomic regions of no trait variation. Whole genome Bayesian models are an effective way of incorporating this and other important prior information into modelling. However a full Bayesian analysis is often not feasible due to the large computational time involved.
This article proposes an expectation-maximization (EM) algorithm called emBayesB which allows only a proportion of SNP to be in LD with QTL and incorporates prior information about the distribution of SNP effects. The posterior probability of being in LD with at least one QTL is calculated for each SNP along with estimates of the hyperparameters for the mixture prior. A simulated example of genomic selection from an international workshop is used to demonstrate the features of the EM algorithm. The accuracy of prediction is comparable to a full Bayesian analysis but the EM algorithm is considerably faster. The EM algorithm was accurate in locating QTL which explained more than 1% of the total genetic variation. A computational algorithm for very large SNP panels is described.
emBayesB is a fast and accurate EM algorithm for implementing genomic selection and predicting complex traits by mapping QTL in genome-wide dense SNP marker data. Its accuracy is similar to Bayesian methods but it takes only a fraction of the time.
利用高通量技术提供的密集全基因组标记信息在人类疾病研究和家畜育种计划中具有相当大的潜力。全基因组关联研究将密集 SNP 面板中的个体单核苷酸多态性(SNP)与复杂性状的个体测量值相关联,其基本假设是任何关联都是由 SNP 与影响性状的数量性状位点(QTL)之间的连锁不平衡(LD)引起的。通常,SNP 位于没有性状变异的基因组区域。全基因组贝叶斯模型是将这种信息和其他重要先验信息纳入模型的有效方法。然而,由于涉及的计算时间较大,通常无法进行全贝叶斯分析。
本文提出了一种称为 emBayesB 的期望最大化(EM)算法,该算法允许只有一部分 SNP 与 QTL 处于 LD 状态,并结合了 SNP 效应分布的先验信息。对于每个 SNP,计算与至少一个 QTL 处于 LD 的后验概率,同时估计混合先验的超参数。使用国际研讨会的基因组选择模拟示例来演示 EM 算法的特征。预测的准确性可与全贝叶斯分析相媲美,但 EM 算法要快得多。EM 算法在定位解释总遗传变异超过 1%的 QTL 方面非常准确。还描述了一个用于非常大的 SNP 面板的计算算法。
emBayesB 是一种快速准确的 EM 算法,用于通过在全基因组密集 SNP 标记数据中映射 QTL 来实现基因组选择和预测复杂性状。它的准确性与贝叶斯方法相似,但只需要一小部分时间。