Gonzalez-Ferrer Jesus, Lehrer Julian, O'Farrell Ash, Paten Benedict, Teodorescu Mircea, Haussler David, Jonsson Vanessa D, Mostajo-Radji Mohammed A
These authors contributed equally to this work.
Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA.
bioRxiv. 2023 Nov 17:2023.02.28.529615. doi: 10.1101/2023.02.28.529615.
Large single-cell RNA datasets have contributed to unprecedented biological insight. Often, these take the form of cell atlases and serve as a reference for automating cell labeling of newly sequenced samples. Yet, classification algorithms have lacked the capacity to accurately annotate cells, particularly in complex datasets. Here we present SIMS (Scalable, Interpretable Machine Learning for Single-Cell), an end-to-end data-efficient machine learning pipeline for discrete classification of single-cell data that can be applied to new datasets with minimal coding. We benchmarked SIMS against common single-cell label transfer tools and demonstrated that it performs as well or better than state of the art algorithms. We then use SIMS to classify cells in one of the most complex tissues: the brain. We show that SIMS classifies cells of the adult cerebral cortex and hippocampus at a remarkably high accuracy. This accuracy is maintained in trans-sample label transfers of the adult human cerebral cortex. We then apply SIMS to classify cells in the developing brain and demonstrate a high level of accuracy at predicting neuronal subtypes, even in periods of fate refinement, shedding light on genetic changes affecting specific cell types across development. Finally, we apply SIMS to single cell datasets of cortical organoids to predict cell identities and unveil genetic variations between cell lines. SIMS identifies cell-line differences and misannotated cell lineages in human cortical organoids derived from different pluripotent stem cell lines. When cell types are obscured by stress signals, label transfer from primary tissue improves the accuracy of cortical organoid annotations, serving as a reliable ground truth. Altogether, we show that SIMS is a versatile and robust tool for cell-type classification from single-cell datasets.
大型单细胞RNA数据集为前所未有的生物学洞察做出了贡献。通常,这些数据集以细胞图谱的形式呈现,并作为自动标记新测序样本细胞的参考。然而,分类算法一直缺乏准确注释细胞的能力,尤其是在复杂的数据集中。在此,我们展示了SIMS(用于单细胞的可扩展、可解释机器学习),这是一种用于单细胞数据离散分类的端到端数据高效机器学习管道,可在最少编码的情况下应用于新数据集。我们将SIMS与常见的单细胞标签转移工具进行了基准测试,结果表明它的性能与现有最先进算法相当或更优。然后,我们使用SIMS对最复杂的组织之一——大脑中的细胞进行分类。我们发现SIMS对成人大脑皮层和海马体中的细胞分类具有极高的准确性。在成人人类大脑皮层的跨样本标签转移中,这种准确性得以保持。接着,我们将SIMS应用于发育中大脑的细胞分类,并证明即使在命运细化阶段,它在预测神经元亚型方面也具有很高的准确性,揭示了影响整个发育过程中特定细胞类型的基因变化。最后,我们将SIMS应用于皮质类器官的单细胞数据集,以预测细胞身份并揭示细胞系之间的基因变异。SIMS识别出源自不同多能干细胞系的人类皮质类器官中的细胞系差异和错误注释的细胞谱系。当细胞类型被应激信号掩盖时,来自原代组织的标签转移提高了皮质类器官注释的准确性,可作为可靠的真实依据。总之,我们表明SIMS是一种用于从单细胞数据集中进行细胞类型分类的通用且强大的工具。