Zhao Chen, Liu Anqi, Zhang Xiao, Cao Xuewei, Ding Zhengming, Sha Qiuying, Shen Hui, Deng Hong-Wen, Zhou Weihua
Department of Applied Computing, Michigan Technological University, 1400 Townsend Dr, Houghton, MI 49931, USA.
Division of Biomedical Informatics and Genomics, Tulane Center of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University, New Orleans, LA 70112, USA.
ArXiv. 2023 Apr 12:arXiv:2304.05542v1.
Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multi-omics data.
整合异构和高维多组学数据在理解遗传数据方面正变得越来越重要。每种组学技术仅提供了潜在生物过程的有限视角,而同时整合异构的组学层将带来对疾病和表型更全面、更详细的理解。然而,在进行多组学数据整合时面临的一个障碍是由于仪器灵敏度和成本导致存在未配对的多组学数据。如果受试者的某些方面缺失或不完整,研究可能会失败。在本文中,我们提出了一种通过具有对比学习和自注意力的跨组学链接统一嵌入(CLCLSA)来处理不完整数据的多组学整合深度学习方法。该模型以完整的多组学数据作为监督,采用跨组学自动编码器来学习不同类型生物数据的特征表示。在潜在特征拼接之前,使用多组学对比学习来最大化不同类型组学之间的互信息。此外,采用特征级自注意力和组学级自注意力来动态识别用于多组学数据整合的最具信息性的特征。在四个公共多组学数据集上进行了广泛的实验。实验结果表明,所提出的CLCLSA在使用不完整多组学数据进行多组学数据分类方面优于现有方法。