Weinhold Leonie, Wahl Simone, Pechlivanis Sonali, Hoffmann Per, Schmid Matthias
Department of Medical Biometry, Informatics and Epidemiology, University of Bonn, Sigmund-Freud-Str. 25, Bonn, D-53127, Germany.
Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, Ingolstädter Landstr. 1, Neuherber, D-85764, Germany.
BMC Bioinformatics. 2016 Nov 22;17(1):480. doi: 10.1186/s12859-016-1347-4.
The analysis of DNA methylation is a key component in the development of personalized treatment approaches. A common way to measure DNA methylation is the calculation of beta values, which are bounded variables of the form M/(M+U) that are generated by Illumina's 450k BeadChip array. The statistical analysis of beta values is considered to be challenging, as traditional methods for the analysis of bounded variables, such as M-value regression and beta regression, are based on regularity assumptions that are often too strong to adequately describe the distribution of beta values.
We develop a statistical model for the analysis of beta values that is derived from a bivariate gamma distribution for the signal intensities M and U. By allowing for possible correlations between M and U, the proposed model explicitly takes into account the data-generating process underlying the calculation of beta values. Using simulated data and a real sample of DNA methylation data from the Heinz Nixdorf Recall cohort study, we demonstrate that the proposed model fits our data significantly better than beta regression and M-value regression.
The proposed model contributes to an improved identification of associations between beta values and covariates such as clinical variables and lifestyle factors in epigenome-wide association studies. It is as easy to apply to a sample of beta values as beta regression and M-value regression.
DNA甲基化分析是个性化治疗方法发展的关键组成部分。测量DNA甲基化的一种常用方法是计算β值,β值是由Illumina公司的450k BeadChip芯片阵列生成的形式为M/(M + U)的有界变量。β值的统计分析被认为具有挑战性,因为用于分析有界变量的传统方法,如M值回归和β回归,是基于通常过于严格而无法充分描述β值分布的正则性假设。
我们开发了一种用于分析β值的统计模型,该模型源自信号强度M和U的双变量伽马分布。通过考虑M和U之间可能的相关性,所提出的模型明确考虑了β值计算背后的数据生成过程。使用模拟数据和来自海因茨·尼克斯多夫召回队列研究的DNA甲基化数据真实样本,我们证明所提出的模型比β回归和M值回归能更好地拟合我们的数据。
所提出的模型有助于在全表观基因组关联研究中更好地识别β值与协变量(如临床变量和生活方式因素)之间的关联。它应用于β值样本与β回归和M值回归一样容易。