Motwani Keshav, Bacher Rhonda, Molstad Aaron J
Department of Biostatistics, University of Washington.
Department of Biostatistics, University of Florida.
Ann Appl Stat. 2023 Dec;17(4):3426-3449. doi: 10.1214/23-aoas1769.
Categorizing individual cells into one of many known cell type categories, also known as cell type annotation, is a critical step in the analysis of single-cell genomics data. The current process of annotation is time-intensive and subjective, which has led to different studies describing cell types with labels of varying degrees of resolution. While supervised learning approaches have provided automated solutions to annotation, there remains a significant challenge in fitting a unified model for multiple datasets with inconsistent labels. In this article, we propose a new multinomial logistic regression estimator which can be used to model cell type probabilities by integrating multiple datasets with labels of varying resolution. To compute our estimator, we solve a nonconvex optimization problem using a blockwise proximal gradient descent algorithm. We show through simulation studies that our approach estimates cell type probabilities more accurately than competitors in a wide variety of scenarios. We apply our method to ten single-cell RNA-seq datasets and demonstrate its utility in predicting fine resolution cell type labels on unlabeled data as well as refining cell type labels on data with existing coarse resolution annotations. Finally, we demonstrate that our method can lead to novel scientific insights in the context of a differential expression analysis comparing peripheral blood gene expression before and after treatment with interferon- . An R package implementing the method is available at https://github.com/keshav-motwani/IBMR and the collection of datasets we analyze is available at https://github.com/keshav-motwani/AnnotatedPBMC.
将单个细胞归类到众多已知细胞类型类别中的一种,即细胞类型注释,是单细胞基因组学数据分析中的关键步骤。当前的注释过程既耗时又主观,这导致不同研究使用分辨率不同的标签来描述细胞类型。虽然监督学习方法为注释提供了自动化解决方案,但在为标签不一致的多个数据集拟合统一模型方面仍存在重大挑战。在本文中,我们提出了一种新的多项逻辑回归估计器,它可通过整合具有不同分辨率标签的多个数据集来对细胞类型概率进行建模。为了计算我们的估计器,我们使用块近端梯度下降算法解决一个非凸优化问题。我们通过模拟研究表明,在各种场景下,我们的方法比其他方法更准确地估计细胞类型概率。我们将我们的方法应用于十个单细胞RNA测序数据集,并展示了其在预测未标记数据上的高分辨率细胞类型标签以及细化具有现有粗分辨率注释的数据上的细胞类型标签方面的效用。最后,我们证明了我们的方法在比较干扰素治疗前后外周血基因表达的差异表达分析背景下能够带来新的科学见解。实现该方法的R包可在https://github.com/keshav-motwani/IBMR获取,我们分析的数据集集合可在https://github.com/keshav-motwani/AnnotatedPBMC获取。