Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN.
Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY.
Brief Bioinform. 2018 May 1;19(3):374-386. doi: 10.1093/bib/bbw133.
High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the count-based tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.
高通量亚硫酸氢盐甲基化测序,如简化代表性亚硫酸氢盐测序(RRBS)、安捷伦 SureSelect 人类甲基化测序(Methyl-seq)或全基因组亚硫酸氢盐测序,常用于碱基分辨率甲基组学研究。这些数据要么通过 CpG 位点的甲基化胞嘧啶与总覆盖度的比例表示,要么通过甲基化和未甲基化胞嘧啶的数量表示。可以使用多种统计方法来检测条件之间的差异甲基化 CpG(DMC),这些方法通常是下一步差异甲基化区域识别的基础。比值数据具有拟合许多线性模型的灵活性,但原始计数数据考虑了覆盖信息。在每种数据类型中,DMC 检测都有多种选择,但哪种统计方法是最优的尚不清楚。在这项研究中,我们系统地评估了基于比率数据的 4 种统计方法和基于计数数据的 4 种方法,并使用真实 RRBS 数据和模拟数据比较了它们在 I 型错误控制、DMC 检测的敏感性和特异性以及计算资源需求方面的性能。结果表明,基于比率的测试通常比基于计数的测试更为保守(敏感性较低)。然而,一些基于计数的方法存在较高的假阳性率,应予以避免。贝塔二项式模型在敏感性和特异性之间取得了良好的平衡,是首选的方法。还讨论了在不同情况下(信号与噪声、样本量估计)选择方法的问题。