Hardcastle Thomas J
Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge, CB2 3EA, UK.
BMC Bioinformatics. 2017 Sep 18;18(1):416. doi: 10.1186/s12859-017-1836-0.
Cytosine methylation is widespread in most eukaryotic genomes and is known to play a substantial role in various regulatory pathways. Unmethylated cytosines may be converted to uracil through the addition of sodium bisulphite, allowing genome-wide quantification of cytosine methylation via high-throughput sequencing. The data thus acquired allows the discovery of methylation 'loci'; contiguous regions of methylation consistently methylated across biological replicates. The mapping of these loci allows for associations with other genomic factors to be identified, and for analyses of differential methylation to take place.
The segmentSeq R package is extended to identify methylation loci from high-throughput sequencing data from multiple experimental conditions. A statistical model is then developed that accounts for biological replication and variable rates of non-conversion of cytosines in each sample to compute posterior likelihoods of methylation at each locus within an empirical Bayesian framework. The same model is used as a basis for analysis of differential methylation between multiple experimental conditions with the baySeq R package. We demonstrate the capability of this method to analyse complex data sets in an analysis of data derived from multiple Dicer-like mutants in Arabidopsis. This reveals several novel behaviours at distinct sets of loci in response to loss of one or more of the Dicer-like proteins that indicate an antagonistic relationship between the Dicer-like proteins at at least some methylation loci. Finally, we show in simulation studies that this approach can be significantly more powerful in the detection of differential methylation than many existing methods in data derived from both mammalian and plant systems.
The methods developed here make it possible to analyse high-throughput sequencing of the methylome of any given organism under a diverse set of experimental conditions. The methods are able to identify methylation loci and evaluate the likelihood that a region is truly methylated under any given experimental condition, allowing for downstream analyses that characterise differences between methylated and non-methylated regions of the genome. Futhermore, diverse patterns of differential methylation may also be characterised from these data.
胞嘧啶甲基化在大多数真核生物基因组中广泛存在,并且已知在各种调控途径中发挥重要作用。未甲基化的胞嘧啶可通过添加亚硫酸氢钠转化为尿嘧啶,从而通过高通量测序对全基因组的胞嘧啶甲基化进行定量。由此获得的数据可用于发现甲基化“位点”;即在生物重复样本中持续甲基化的连续甲基化区域。这些位点的定位有助于识别与其他基因组因子的关联,并进行差异甲基化分析。
对segmentSeq R包进行了扩展,以从多个实验条件下的高通量测序数据中识别甲基化位点。然后开发了一个统计模型,该模型考虑了生物重复以及每个样本中胞嘧啶未转化的可变率,以在经验贝叶斯框架内计算每个位点甲基化的后验似然性。同一模型被用作使用baySeq R包分析多个实验条件之间差异甲基化的基础。我们在对拟南芥中多个类似Dicer突变体的数据进行分析时,展示了该方法分析复杂数据集的能力。这揭示了在不同位点集上的几种新行为,这些行为是对一种或多种类似Dicer蛋白缺失的响应,表明在至少一些甲基化位点上类似Dicer蛋白之间存在拮抗关系。最后,我们在模拟研究中表明,在来自哺乳动物和植物系统的数据中,这种方法在检测差异甲基化方面比许多现有方法更强大。
这里开发的方法使得在多种实验条件下分析任何给定生物体甲基化组的高通量测序成为可能。这些方法能够识别甲基化位点,并评估在任何给定实验条件下一个区域真正甲基化的可能性,从而进行下游分析,以表征基因组甲基化和未甲基化区域之间的差异。此外,还可以从这些数据中表征不同的差异甲基化模式。