Whitaker Biomedical Engineering Institute, Johns Hopkins University, Baltimore, MD, USA.
Center for Epigenetics, Johns Hopkins School of Medicine, Baltimore, MD, USA.
BMC Bioinformatics. 2018 Mar 7;19(1):87. doi: 10.1186/s12859-018-2086-5.
DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads.
We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data.
This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods.
DNA 甲基化是细胞用于控制基因表达的一种稳定的表观遗传记忆形式。全基因组亚硫酸氢盐测序(WGBS)已成为研究 DNA 甲基化的金标准实验技术,它可以生成高分辨率的全基因组甲基化图谱。统计建模和分析被用于从这些图谱中计算提取和量化信息,以识别表现出关键或异常表观遗传行为的基因组区域。然而,大多数现有的甲基化分析方法由于无法直接考虑相邻甲基化位点之间的统计依赖性,从而忽略了 WGBS 读取中提供的重要信息,因此其性能受到阻碍。
我们提出了一种基于统计物理学 1D Ising 模型的强大的全基因组 WGBS 数据建模和分析方法。该方法通过利用联合概率模型来考虑甲基化的相关性,该模型包含了 WGBS 甲基化读取中所有可用的信息,即使在应用于具有低覆盖度的单个 WGBS 样本时也能产生准确的结果。我们的方法使用香农熵,在全基因组范围内对单个 WGBS 样本中的甲基化随机性进行了严格的量化。此外,它利用 Jensen-Shannon 距离来评估测试样本和参考样本之间的甲基化分布差异。使用模拟和真实人类肺部正常/癌症数据进行的差异性能评估表明,我们的方法明显优于最近提出的用于 WGBS 数据分析的 DSS 方法。至关重要的是,这些结果表明,当数据中存在相关性时,边际方法在统计学上变得无效。
本研究证明了使用统计物理学 1D Ising 模型对甲基化联合概率分布进行建模以及使用信息论概念对甲基化随机性进行量化的明显优势。通过采用这种方法,可以通过有效地利用 WGBS 数据中大量的统计信息,大大提高 DNA 甲基化分析的性能,而这在很大程度上被现有的方法所忽略。