Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden.
Department of Mathematical Sciences, Chalmers University of Technology, Sweden.
Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad293.
Precision medicine relies on the identification of robust disease and risk factor signatures from omics data. However, current knowledge-driven approaches may overlook novel or unexpected phenomena due to the inherent biases in biological knowledge. In this study, we present a data-driven signature discovery workflow for DNA methylation analysis utilizing network-coherent autoencoders (NCAEs) with biologically relevant latent embeddings. First, we explored the architecture space of autoencoders trained on a large-scale pan-tissue compendium (n = 75 272) of human epigenome-wide association studies. We observed the emergence of co-localized patterns in the deep autoencoder latent space representations that corresponded to biological network modules. We determined the NCAE configuration with the strongest co-localization and centrality signals in the human protein interactome. Leveraging the NCAE embeddings, we then trained interpretable deep neural networks for risk factor (aging, smoking) and disease (systemic lupus erythematosus) prediction and classification tasks. Remarkably, our NCAE embedding-based models outperformed existing predictors, revealing novel DNA methylation signatures enriched in gene sets and pathways associated with the studied condition in each case. Our data-driven biomarker discovery workflow provides a generally applicable pipeline to capture relevant risk factor and disease information. By surpassing the limitations of knowledge-driven methods, our approach enhances the understanding of complex epigenetic processes, facilitating the development of more effective diagnostic and therapeutic strategies.
精准医学依赖于从组学数据中识别稳健的疾病和风险因素特征。然而,由于生物知识固有的偏见,当前的知识驱动方法可能会忽略新的或意外的现象。在这项研究中,我们提出了一种利用具有生物学相关潜在嵌入的网络一致自动编码器(NCAE)进行 DNA 甲基化分析的基于数据的特征发现工作流程。首先,我们探索了在大规模泛组织纲要(n = 75272)上训练的自动编码器的架构空间,这些纲要包括人类全基因组关联研究的表观基因组。我们观察到,在深度自动编码器潜在空间表示中出现了与生物学网络模块相对应的局部化模式。我们确定了在人类蛋白质互作网络中具有最强局部化和中心性信号的 NCAE 配置。利用 NCAE 嵌入,我们随后训练了可解释的深层神经网络,用于风险因素(衰老、吸烟)和疾病(系统性红斑狼疮)的预测和分类任务。值得注意的是,我们基于 NCAE 嵌入的模型优于现有的预测器,揭示了与所研究条件相关的基因集和途径中富集的新的 DNA 甲基化特征。我们的数据驱动生物标志物发现工作流程提供了一种通用的管道,可以捕获相关的风险因素和疾病信息。通过超越知识驱动方法的局限性,我们的方法增强了对复杂表观遗传过程的理解,促进了更有效的诊断和治疗策略的发展。