自组学：一种用于多组学生物标志物癌症数据的自监督学习框架。

Self-omics: A Self-supervised Learning Framework for Multi-omics Cancer Data.

机构信息

Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE,

出版信息

Pac Symp Biocomput. 2023;28:263-274.

Abstract

We have gained access to vast amounts of multi-omics data thanks to Next Generation Sequencing. However, it is challenging to analyse this data due to its high dimensionality and much of it not being annotated. Lack of annotated data is a significant problem in machine learning, and Self-Supervised Learning (SSL) methods are typically used to deal with limited labelled data. However, there is a lack of studies that use SSL methods to exploit inter-omics relationships on unlabelled multi-omics data. In this work, we develop a novel and efficient pre-training paradigm that consists of various SSL components, including but not limited to contrastive alignment, data recovery from corrupted samples, and using one type of omics data to recover other omic types. Our pre-training paradigm improves performance on downstream tasks with limited labelled data. We show that our approach outperforms the state-of-the-art method in cancer type classification on the TCGA pancancer dataset in semi-supervised setting. Moreover, we show that the encoders that are pre-trained using our approach can be used as powerful feature extractors even without fine-tuning. Our ablation study shows that the method is not overly dependent on any pretext task component. The network architectures in our approach are designed to handle missing omic types and multiple datasets for pre-training and downstream training. Our pre-training paradigm can be extended to perform zero-shot classification of rare cancers.

摘要

由于下一代测序技术，我们已经获得了大量的多组学数据。然而，由于其高维性和大部分未注释的数据，分析这些数据具有挑战性。注释数据的缺乏是机器学习中的一个重大问题，通常使用自监督学习 (SSL) 方法来处理有限的标记数据。然而，利用 SSL 方法来挖掘未标记的多组学数据中的组学间关系的研究还很少。在这项工作中，我们开发了一种新颖而有效的预训练范例，它包含各种 SSL 组件，包括但不限于对比对齐、从损坏样本中恢复数据，以及使用一种组学数据来恢复其他组学类型。我们的预训练范例在使用有限标记数据的下游任务中提高了性能。我们表明，在 TCGA 泛癌数据集的半监督设置中，我们的方法在癌症类型分类方面优于最先进的方法。此外，我们还表明，即使没有微调，使用我们的方法预训练的编码器也可以用作强大的特征提取器。我们的消融研究表明，该方法并不过于依赖任何预训练任务组件。我们方法中的网络架构旨在处理缺失的组学类型和多个数据集进行预训练和下游训练。我们的预训练范例可以扩展为实现罕见癌症的零样本分类。