具有信息性协变量的表观遗传学甲基化研究中的快速矩阵完成。

Fast matrix completion in epigenetic methylation studies with informative covariates.

机构信息

Department of Decision Science, HEC Montreal, 3000 chemin de la Cote Ste Catherine Montréal, QC H3T 2A7 Montreal, Canada.

Department of Mathematics, Université du Québec à Montreal, 201, Ave Président-Kennedy Montreal (QC), H2X 3Y7 Montreal, Canada.

出版信息

Biostatistics. 2024 Oct 1;25(4):1062-1078. doi: 10.1093/biostatistics/kxae016.

DOI:10.1093/biostatistics/kxae016

PMID:38850151

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11471954/

Abstract

DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.

摘要

DNA 甲基化是一种重要的表观遗传标记，通过抑制转录蛋白与 DNA 的结合来调节基因表达。在许多其他组学实验中，缺失值问题是一个重要问题，适当的插补技术对于避免不必要的样本量减少以及最佳利用收集到的信息非常重要。我们考虑这样一种情况，即相对较少的样本通过昂贵的高密度全基因组亚硫酸氢盐测序（WGBS）策略进行处理，而更多的样本则使用更经济实惠的低密度、基于阵列的技术进行处理。在这种情况下，可以使用 WGBS 样本提供的高密度信息来插补低覆盖度（基于阵列的）甲基化数据。在本文中，我们提出了一种有效的带有信息协变量的核心关联线性模型（LMCC），以根据观测值和协变量预测缺失值。我们的模型假设，在每个位点，所有样本的甲基化向量与一组固定因子（协变量）和一组潜在因子相关联。此外，我们通过分别在固定和潜在系数向量上假设一些高斯过程，利用了数据的函数性质和位点之间的空间相关性。我们的模拟表明，使用协变量可以显著提高插补值的准确性，特别是在缺失数据包含有关解释变量的一些相关信息的情况下。我们还表明，当列数远大于行数时，我们提出的模型特别有效-这在甲基化数据分析中通常是这种情况。最后，我们在两个真实的甲基化数据集上应用并比较了我们提出的方法和替代方法，展示了细胞类型、组织类型或年龄等协变量如何提高插补值的准确性。