College of Medicine and Health, University of Exeter, Royal Devon and Exeter Hospital, Exeter, EX2 5DW, UK.
Department of Pathology, Beth Israel Deaconess Medical Center, 330 Brookline-Avenue, Boston, Massachusetts, USA.
BMC Genomics. 2021 Jun 15;22(1):446. doi: 10.1186/s12864-021-07721-z.
The combination of sodium bisulfite treatment with highly-parallel sequencing is a common method for quantifying DNA methylation across the genome. The power to detect between-group differences in DNA methylation using bisulfite-sequencing approaches is influenced by both experimental (e.g. read depth, missing data and sample size) and biological (e.g. mean level of DNA methylation and difference between groups) parameters. There is, however, no consensus about the optimal thresholds for filtering bisulfite sequencing data with implications for the reproducibility of findings in epigenetic epidemiology.
We used a large reduced representation bisulfite sequencing (RRBS) dataset to assess the distribution of read depth across DNA methylation sites and the extent of missing data. To investigate how various study variables influence power to identify DNA methylation differences between groups, we developed a framework for simulating bisulfite sequencing data. As expected, sequencing read depth, group size, and the magnitude of DNA methylation difference between groups all impacted upon statistical power. The influence on power was not dependent on one specific parameter, but reflected the combination of study-specific variables. As a resource to the community, we have developed a tool, POWEREDBiSeq, which utilizes our simulation framework to predict study-specific power for the identification of DNAm differences between groups, taking into account user-defined read depth filtering parameters and the minimum sample size per group.
Our data-driven approach highlights the importance of filtering bisulfite-sequencing data by minimum read depth and illustrates how the choice of threshold is influenced by the specific study design and the expected differences between groups being compared. The POWEREDBiSeq tool, which can be applied to different types of bisulfite sequencing data (e.g. RRBS, whole genome bisulfite sequencing (WGBS), targeted bisulfite sequencing and amplicon-based bisulfite sequencing), can help users identify the level of data filtering needed to optimize power and aims to improve the reproducibility of bisulfite sequencing studies.
亚硫酸氢盐处理与高通量测序相结合是一种常用的方法,用于定量整个基因组的 DNA 甲基化。使用亚硫酸氢盐测序方法检测 DNA 甲基化组间差异的能力受到实验(例如读取深度、缺失数据和样本量)和生物学(例如 DNA 甲基化的平均水平和组间差异)参数的影响。然而,对于具有影响表观遗传流行病学研究发现可重复性的亚硫酸氢盐测序数据过滤的最佳阈值,目前尚无共识。
我们使用了一个大型的简化代表性亚硫酸氢盐测序(RRBS)数据集来评估读取深度在 DNA 甲基化位点上的分布情况和缺失数据的程度。为了研究各种研究变量如何影响识别组间 DNA 甲基化差异的能力,我们开发了一个模拟亚硫酸氢盐测序数据的框架。正如预期的那样,测序读取深度、组大小和组间 DNA 甲基化差异的大小都对统计能力产生了影响。这种影响的大小不是取决于一个特定的参数,而是反映了研究特定变量的组合。作为社区的资源,我们开发了一个工具,即 POWEREDBiSeq,它利用我们的模拟框架来预测特定于研究的能力,用于识别组间 DNAm 差异,同时考虑用户定义的读取深度过滤参数和每组的最小样本量。
我们的数据驱动方法强调了通过最小读取深度过滤亚硫酸氢盐测序数据的重要性,并说明了阈值的选择如何受到特定研究设计和正在比较的组间预期差异的影响。POWEREDBiSeq 工具可应用于不同类型的亚硫酸氢盐测序数据(例如 RRBS、全基因组亚硫酸氢盐测序(WGBS)、靶向亚硫酸氢盐测序和基于扩增子的亚硫酸氢盐测序),可以帮助用户确定优化能力所需的数据过滤水平,并旨在提高亚硫酸氢盐测序研究的可重复性。