Bishara Isaac, Chen Jinfeng, Griffiths Jason I, Bild Andrea H, Nath Aritro
Department of Medical Oncology and Therapeutics, City of Hope Comprehensive Cancer Center, Duarte, CA, United States.
Irell & Manella Graduate School of Biological Science, City of Hope Comprehensive Cancer Center, Duarte, CA, United States.
Front Genet. 2022 Nov 25;13:982019. doi: 10.3389/fgene.2022.982019. eCollection 2022.
Recent advances in single cell RNA sequencing (scRNA-seq) technologies have been invaluable in the study of the diversity of cancer cells and the tumor microenvironment. While scRNA-seq platforms allow processing of a high number of cells, uneven read quality and technical artifacts hinder the ability to identify and classify biologically relevant cells into correct subtypes. This obstructs the analysis of cancer and normal cell diversity, while rare and low expression cell populations may be lost by setting arbitrary high cutoffs for UMIs when filtering out low quality cells. To address these issues, we have developed a novel machine-learning framework that: 1. Trains cell lineage and subtype classifier using a gold standard dataset validated using marker genes 2. Systematically assess the lowest UMI threshold that can be used in a given dataset to accurately classify cells 3. Assign accurate cell lineage and subtype labels to the lower read depth cells recovered by setting the optimal threshold. We demonstrate the application of this framework in a well-curated scRNA-seq dataset of breast cancer patients and two external datasets. We show that the minimum UMI threshold for the breast cancer dataset could be lowered from the original 1500 to 450, thereby increasing the total number of recovered cells by 49%, while achieving a classification accuracy of >0.9. Our framework provides a roadmap for future scRNA-seq studies to determine optimal UMI threshold and accurately classify cells for downstream analyses.
单细胞RNA测序(scRNA-seq)技术的最新进展在癌细胞多样性和肿瘤微环境的研究中具有重要价值。虽然scRNA-seq平台能够处理大量细胞,但读数质量不均和技术假象阻碍了将生物学相关细胞识别并分类为正确亚型的能力。这妨碍了对癌症和正常细胞多样性的分析,同时在过滤低质量细胞时,通过设置任意高的UMI截止值,稀有和低表达细胞群体可能会丢失。为了解决这些问题,我们开发了一种新颖的机器学习框架,该框架:1. 使用经标记基因验证的金标准数据集训练细胞谱系和亚型分类器;2. 系统评估给定数据集中可用于准确分类细胞的最低UMI阈值;3. 通过设置最佳阈值为恢复的低读数深度细胞分配准确的细胞谱系和亚型标签。我们在精心整理的乳腺癌患者scRNA-seq数据集和两个外部数据集中展示了该框架的应用。我们表明,乳腺癌数据集的最低UMI阈值可以从原来的1500降至450,从而使恢复的细胞总数增加49%,同时实现>0.9的分类准确率。我们的框架为未来的scRNA-seq研究提供了路线图,以确定最佳UMI阈值并为下游分析准确分类细胞。