Department of Computer Science and Engineering, University of Bologna, Mura Anteo Zamboni 7, Bologna, Italy.
Department of Physics and Astronomy, University of Bologna, Viale Berti Pichat 6/2, Bologna, Italy.
Bioinformatics. 2019 Oct 1;35(19):3786-3793. doi: 10.1093/bioinformatics/btz134.
DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed.
We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values.
The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp.
Supplementary data are available at Bioinformatics online.
DNA 甲基化是一种稳定的表观遗传标记,对生理(发育、衰老)和病理条件(癌症和许多疾病)都有重要影响。最近涉及甲基化的研究集中在开发基于 DNA 甲基化水平的分子年龄估计方法(mAge)上。越来越多的研究表明,mAge 与实际年龄之间的差异可能与与年龄相关的疾病有关。高通量技术的最新进展允许对整个人类基因组的 DNA 甲基化水平进行特征描述。然而,实验性甲基化谱通常包含多个缺失值,这可能会影响数据分析和 mAge 估计。尽管存在几种插补方法,但主要的缺陷在于无法处理大型数据集,例如 DNA 甲基化芯片。因此,需要专门的方法来插补缺失的甲基化数据。
我们提出了一种简单且计算效率高的插补方法 methyLImp,基于线性回归。该方法的基本原理在于观察到甲基化水平具有高度的样本间相关性。我们在来自不同组织的健康和疾病样本的 DNA 甲基化数据上,对我们的方法与其他插补方法进行了比较研究。性能评估既考虑了插补准确性,也考虑了插补值对 mAge 估计的影响。与现有方法相比,我们的线性回归模型表现同样出色或更好,并且具有良好的计算效率。我们分析的结果为准确估计缺失的甲基化值提供了建议。
R 包 methyLImp 可在 https://github.com/pdilena/methyLImp 上免费获得。
补充数据可在生物信息学在线获得。