Whitaker Biomedical Engineering Institute, Johns Hopkins University, Baltimore, MD, USA.
Center for Epigenetics, Johns Hopkins School of Medicine, Baltimore, MD, USA.
BMC Bioinformatics. 2019 Apr 8;20(1):175. doi: 10.1186/s12859-019-2777-6.
Establishment and maintenance of DNA methylation throughout the genome is an important epigenetic mechanism that regulates gene expression whose disruption has been implicated in human diseases like cancer. It is therefore crucial to know which genes, or other genomic features of interest, exhibit significant discordance in DNA methylation between two phenotypes. We have previously proposed an approach for ranking genes based on methylation discordance within their promoter regions, determined by centering a window of fixed size at their transcription start sites. However, we cannot use this method to identify statistically significant genomic features and handle features of variable length and with missing data.
We present a new approach for computing the statistical significance of methylation discordance within genomic features of interest in single and multiple test/reference studies. We base the proposed method on a well-articulated hypothesis testing problem that produces p- and q-values for each genomic feature, which we then use to identify and rank features based on the statistical significance of their epigenetic dysregulation. We employ the information-theoretic concept of mutual information to derive a novel test statistic, which we can evaluate by computing Jensen-Shannon distances between the probability distributions of methylation in a test and a reference sample. We design the proposed methodology to simultaneously handle biological, statistical, and technical variability in the data, as well as variable feature lengths and missing data, thus enabling its wide-spread use on any list of genomic features. This is accomplished by estimating, from reference data, the null distribution of the test statistic as a function of feature length using generalized additive regression models. Differential assessment, using normal/cancer data from healthy fetal tissue and pediatric high-grade glioma patients, illustrates the potential of our approach to greatly facilitate the exploratory phases of clinically and biologically relevant methylation studies.
The proposed approach provides the first computational tool for statistically testing and ranking genomic features of interest based on observed DNA methylation discordance in comparative studies that accounts, in a rigorous manner, for biological, statistical, and technical variability in methylation data, as well as for variability in feature length and for missing data.
在整个基因组中建立和维持 DNA 甲基化是一种重要的表观遗传机制,它调节基因表达,其破坏与癌症等人类疾病有关。因此,了解哪些基因或其他感兴趣的基因组特征在两种表型之间的 DNA 甲基化中存在显著差异是至关重要的。我们之前提出了一种基于启动子区域内甲基化差异对基因进行排序的方法,方法是在转录起始位点固定大小的窗口内对其进行中心化。然而,我们不能使用这种方法来识别具有统计学意义的基因组特征,也不能处理长度可变且存在缺失数据的特征。
我们提出了一种新的方法,用于计算单测和多测/参考研究中感兴趣的基因组特征内甲基化差异的统计显著性。我们的方法基于一个精心阐述的假设检验问题,该问题为每个基因组特征生成 p 值和 q 值,然后我们使用这些 p 值和 q 值来根据其表观遗传失调的统计显著性来识别和排序特征。我们利用信息论中互信息的概念来推导出一个新的检验统计量,我们可以通过计算测试和参考样本中甲基化概率分布之间的 Jensen-Shannon 距离来评估这个检验统计量。我们设计了这个方法,以便同时处理数据中的生物学、统计学和技术变异性,以及特征长度的可变性和缺失数据,从而使其能够广泛应用于任何基因组特征列表。这是通过使用广义加性回归模型,从参考数据中估计检验统计量的特征长度的函数的零分布来实现的。使用来自健康胎儿组织和儿科高级别神经胶质瘤患者的正常/癌症数据进行的差异评估说明了我们的方法的潜力,该方法可以极大地促进具有临床和生物学意义的甲基化研究的探索阶段。
该方法提供了第一个计算工具,用于在比较研究中基于观察到的 DNA 甲基化差异对感兴趣的基因组特征进行统计检验和排序,该方法严格考虑了甲基化数据中的生物学、统计学和技术变异性,以及特征长度的可变性和缺失数据。