Maciejewski Emily, Horvath Steve, Ernst Jason
Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095, USA.
Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095, USA.
bioRxiv. 2023 Nov 27:2023.11.26.568769. doi: 10.1101/2023.11.26.568769.
DNA methylation data offers valuable insights into various aspects of mammalian biology. The recent introduction and large-scale application of the mammalian methylation array has significantly expanded the availability of such data across conserved sites in many mammalian species. In our study, we consider 13,245 samples profiled on this array encompassing 348 species and 59 tissues from 746 species-tissue combinations. While having some coverage of many different species and tissue types, this data captures only 3.6% of potential species-tissue combinations. To address this gap, we developed CMImpute (Cross-species Methylation Imputation), a method based on a Conditional Variational Autoencoder, to impute DNA methylation for non-profiled species-tissue combinations. In cross-validation, we demonstrate that CMImpute achieves a strong correlation with actual observed values, surpassing several baseline methods. Using CMImpute we imputed methylation data for 19,786 new species-tissue combinations. We believe that both CMImpute and our imputed data resource will be useful for DNA methylation analyses across a wide range of mammalian species.
DNA甲基化数据为哺乳动物生物学的各个方面提供了有价值的见解。哺乳动物甲基化阵列的近期引入和大规模应用显著扩展了此类数据在许多哺乳动物物种保守位点的可得性。在我们的研究中,我们考虑了在此阵列上进行分析的13245个样本,这些样本涵盖348个物种以及来自746个物种 - 组织组合的59种组织。虽然该数据涵盖了许多不同的物种和组织类型,但它仅捕获了3.6%的潜在物种 - 组织组合。为了弥补这一差距,我们开发了CMImpute(跨物种甲基化插补法),这是一种基于条件变分自编码器的方法,用于对未分析的物种 - 组织组合进行DNA甲基化插补。在交叉验证中,我们证明CMImpute与实际观测值具有很强的相关性,超过了几种基线方法。使用CMImpute,我们对19786个新的物种 - 组织组合进行了甲基化数据插补。我们相信CMImpute和我们的插补数据资源都将对广泛的哺乳动物物种的DNA甲基化分析有用。