通过对比学习进行表型亚型分析。

Phenotypic subtyping via contrastive learning.

作者信息

Gorla Aditya, Sankararaman Sriram, Burchard Esteban, Flint Jonathan, Zaitlen Noah, Rahmani Elior

机构信息

Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA.

Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA.

出版信息

bioRxiv. 2023 Jan 6:2023.01.05.522921. doi: 10.1101/2023.01.05.522921.

DOI:10.1101/2023.01.05.522921

PMID:36711575

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9881932/

Abstract

Defining and accounting for subphenotypic structure has the potential to increase statistical power and provide a deeper understanding of the heterogeneity in the molecular basis of complex disease. Existing phenotype subtyping methods primarily rely on clinically observed heterogeneity or metadata clustering. However, they generally tend to capture the dominant sources of variation in the data, which often originate from variation that is not descriptive of the mechanistic heterogeneity of the phenotype of interest; in fact, such dominant sources of variation, such as population structure or technical variation, are, in general, expected to be independent of subphenotypic structure. We instead aim to find a subspace with signal that is unique to a group of samples for which we believe that subphenotypic variation exists (e.g., cases of a disease). To that end, we introduce Phenotype Aware Components Analysis (PACA), a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. In the context of disease, PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls. We evaluated PACA using an extensive simulation study, as well as on various subtyping tasks using genotypes, transcriptomics, and DNA methylation data. Our results provide multiple strong evidence that PACA allows us to robustly capture weak unknown variation of interest while being calibrated and well-powered, far superseding the performance of alternative methods. This renders PACA as a state-of-the-art tool for defining subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.

摘要

定义和考量亚表型结构有可能提高统计功效，并能更深入地理解复杂疾病分子基础中的异质性。现有的表型分型方法主要依赖临床观察到的异质性或元数据聚类。然而，它们通常倾向于捕捉数据中的主要变异来源，而这些来源往往并非源自对感兴趣表型的机制异质性具有描述性的变异；事实上，诸如群体结构或技术变异等主要变异来源，通常预期与亚表型结构无关。相反，我们旨在找到一个具有信号的子空间，该子空间对于我们认为存在亚表型变异的一组样本（例如某种疾病的病例）而言是独特的。为此，我们引入了表型感知成分分析（PACA），这是一种对比学习方法，利用典型相关分析来稳健地捕捉亚表型变异的微弱来源。在疾病背景下，PACA学习给定数据集中病例所特有的变异梯度，同时利用对照样本考量病例与对照之间生物学和技术混杂因素的变异及不平衡情况。我们使用广泛的模拟研究以及基于基因型、转录组学和DNA甲基化数据的各种分型任务对PACA进行了评估。我们的结果提供了多个有力证据，表明PACA能够在经过校准且功效良好的情况下，稳健地捕捉感兴趣的未知微弱变异，远远超越了其他方法的性能。这使得PACA成为定义更有可能反映分子异质性的亚型的先进工具，尤其是在具有挑战性的情况下，即表型异质性可能被数据中众多强烈的无关效应所掩盖。