Department of Computer Science, Stanford University, United States,
Pac Symp Biocomput. 2022;27:337-348.
Single-cell RNA sequencing (scRNA-seq) has the potential to provide powerful, high-resolution signatures to inform disease prognosis and precision medicine. This paper takes an important first step towards this goal by developing an interpretable machine learning algorithm, CloudPred, to predict individuals' disease phenotypes from their scRNA-seq data. Predicting phenotype from scRNA-seq is challenging for standard machine learning methods-the number of cells measured can vary by orders of magnitude across individuals and the cell populations are also highly heterogeneous. Typical analysis creates pseudo-bulk samples which are biased toward prior annotations and also lose the single cell resolution. CloudPred addresses these challenges via a novel end-to-end differentiable learning algorithm which is coupled with a biologically informed mixture of cell types model. CloudPred automatically infers the cell subpopulation that are salient for the phenotype without prior annotations. We developed a systematic simulation platform to evaluate the performance of CloudPred and several alternative methods we propose, and find that CloudPred outperforms the alternative methods across several settings. We further validated CloudPred on a real scRNA-seq dataset of 142 lupus patients and controls. CloudPred achieves AUROC of 0.98 while identifying a specific subpopulation of CD4 T cells whose presence is highly indicative of lupus. CloudPred is a powerful new framework to predict clinical phenotypes from scRNA-seq data and to identify relevant cells.
单细胞 RNA 测序 (scRNA-seq) 具有提供强大的、高分辨率特征的潜力,从而为疾病预后和精准医疗提供信息。本文通过开发一种可解释的机器学习算法 CloudPred,朝着这一目标迈出了重要的第一步,该算法可以根据 scRNA-seq 数据预测个体的疾病表型。使用标准的机器学习方法来预测表型是具有挑战性的,因为个体之间测量的细胞数量可以相差几个数量级,而且细胞群体也高度异质。典型的分析方法创建了伪总体样本,这些样本偏向于先前的注释,同时也失去了单细胞分辨率。CloudPred 通过一种新颖的端到端可区分学习算法来解决这些挑战,该算法与生物信息学的细胞类型混合模型相结合。CloudPred 自动推断与表型相关的重要细胞亚群,而无需事先进行注释。我们开发了一个系统的模拟平台来评估 CloudPred 和我们提出的几种替代方法的性能,发现 CloudPred 在多个设置下优于替代方法。我们进一步在 142 名狼疮患者和对照者的真实 scRNA-seq 数据集上验证了 CloudPred。CloudPred 的 AUROC 为 0.98,同时确定了 CD4 T 细胞的一个特定亚群,其存在高度提示狼疮。CloudPred 是一种强大的新框架,可用于从 scRNA-seq 数据预测临床表型,并识别相关细胞。