Center for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha, Hunan 410038, China.
Institute of Molecular Precision Medicine, Xiangya Hospital, Key Laboratory of Molecular Precision Medicine of Hunan Province, Central South University, Changsha, Hunan 410038, China.
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab475.
Investigating differentially methylated regions (DMRs) presented in different tissues or cell types can help to reveal the mechanisms behind the tissue-specific gene expression. The identified tissue-/disease-specific DMRs also can be used as feature markers for spotting the tissues-of-origins of cell-free DNA (cfDNA) in noninvasive diagnosis. In recent years, many methods have been proposed to detect DMRs. However, due to the lack of benchmark DMRs, it is difficult for researchers to choose proper methods and select desirable DMR sets for downstream studies. The application of DMRs, used as feature markers, can be benefited by the longer length of DMRs containing more CpG sites when a threshold is given for the methylation differences of DMRs. According to this, two metrics ($Qn$ and $Ql$), in which the CpG numbers and lengths of DMRs with different methylation differences are weighted differently, are proposed in this paper to evaluate the DMR sets predicted by different methods on BS-seq data. DMR sets predicted by eight methods on both simulated datasets and real BS-seq datasets are evaluated by the proposed metrics, the benchmark-based metrics, and the enrichment analysis of biological data, including genomic features, transcription factors and histones. The rank correlation analysis shows that the $Qn$ and $Ql$ are highly correlated to the benchmark metrics for simulated datasets and the biological data enrichment analysis for real BS-seq data. Therefore, with no need for additional biological data, the proposed metrics can help researchers selecting a more suitable DMR set on a certain BS-seq dataset.
研究不同组织或细胞类型中呈现的差异甲基化区域 (DMRs) 有助于揭示组织特异性基因表达背后的机制。鉴定出的组织/疾病特异性 DMRs 也可以用作特征标记物,用于在非侵入性诊断中发现无细胞游离 DNA (cfDNA) 的组织起源。近年来,已经提出了许多方法来检测 DMRs。然而,由于缺乏基准 DMRs,研究人员难以选择合适的方法和选择理想的 DMR 集进行下游研究。当为 DMR 的甲基化差异设定阈值时,包含更多 CpG 位点的较长长度的 DMR 可用作特征标记物,从而可以应用 DMR。根据这一点,本文提出了两个度量指标($Qn$ 和 $Ql$),其中不同甲基化差异的 DMR 的 CpG 数量和长度以不同的权重进行加权,用于评估 BS-seq 数据上不同方法预测的 DMR 集。使用本文提出的度量指标、基于基准的度量指标和包括基因组特征、转录因子和组蛋白在内的生物数据的富集分析,对八种方法在模拟数据集和真实 BS-seq 数据集上预测的 DMR 集进行了评估。等级相关分析表明,$Qn$ 和 $Ql$ 与模拟数据集的基准度量指标和真实 BS-seq 数据的生物数据富集分析高度相关。因此,无需额外的生物数据,本文提出的度量指标可以帮助研究人员在特定的 BS-seq 数据集上选择更合适的 DMR 集。