Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Str. 47, Greifswald, 17487, Germany.
Theoretical Biology and Biophysics, Group T-6, Los Alamos National Laboratory, New Mexico, Los Alamos, USA.
BMC Bioinformatics. 2018 Mar 27;19(1):105. doi: 10.1186/s12859-018-2115-4.
DNA methylation patterns store epigenetic information in the vast majority of eukaryotic species. The relatively high costs and technical challenges associated with the detection of DNA methylation however have created a bias in the number of methylation studies towards model organisms. Consequently, it remains challenging to infer kingdom-wide general rules about the functions and evolutionary conservation of DNA methylation. Methylated cytosine is often found in specific CpN dinucleotides, and the frequency distributions of, for instance, CpG observed/expected (CpG o/e) ratios have been used to infer DNA methylation types based on higher mutability of methylated CpG.
Predominantly model-based approaches essentially founded on mixtures of Gaussian distributions are currently used to investigate questions related to the number and position of modes of CpG o/e ratios. These approaches require the selection of an appropriate criterion for determining the best model and will fail if empirical distributions are complex or even merely moderately skewed. We use a kernel density estimation (KDE) based technique for robust and precise characterization of complex CpN o/e distributions without a priori assumptions about the underlying distributions.
We show that KDE delivers robust descriptions of CpN o/e distributions. For straightforward processing, we have developed a Galaxy tool, called Notos and available at the ToolShed, that calculates these ratios of input FASTA files and fits a density to their empirical distribution. Based on the estimated density the number and shape of modes of the distribution is determined, providing a rational for the prediction of the number and the types of different methylation classes. Notos is written in R and Perl.
在绝大多数真核生物中,DNA 甲基化模式存储着表观遗传信息。然而,由于检测 DNA 甲基化的成本相对较高且技术挑战较大,导致针对模式生物的甲基化研究数量存在偏差。因此,要推断关于 DNA 甲基化的功能和进化保守性的普遍规律仍然具有挑战性。甲基化的胞嘧啶通常存在于特定的 CpN 二核苷酸中,并且例如 CpG 观察到/预期(CpG o/e)比值的频率分布已被用于根据甲基化 CpG 的更高突变率来推断 DNA 甲基化类型。
目前主要基于混合高斯分布的基于模型的方法被用于研究与 CpG o/e 比值的模式数量和位置有关的问题。这些方法需要选择适当的标准来确定最佳模型,如果经验分布复杂甚至仅是中度偏斜,则这些方法将失败。我们使用基于核密度估计(KDE)的技术来稳健且精确地描述复杂的 CpN o/e 分布,而无需对基础分布做出先验假设。
我们表明 KDE 提供了 CpN o/e 分布的稳健描述。为了便于处理,我们开发了一个名为 Notos 的 Galaxy 工具,可在 ToolShed 中获得,该工具可计算输入 FASTA 文件的这些比值,并对其经验分布拟合密度。基于估计的密度,确定分布模式的数量和形状,为预测不同甲基化类别的数量和类型提供了合理依据。Notos 是用 R 和 Perl 编写的。