Department of Computer Science, University of Virginia, Charlottesville, VA, United States.
Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad493.
The rapid advance in single-cell RNA sequencing (scRNA-seq) technology over the past decade has provided a rich resource of gene expression profiles of single cells measured on patients, facilitating the study of many biological questions at the single-cell level. One intriguing research is to study the single cells which play critical roles in the phenotypes of patients, which has the potential to identify those cells and genes driving the disease phenotypes. To this end, deep learning models are expected to well encode the single-cell information and achieve precise prediction of patients' phenotypes using scRNA-seq data. However, we are facing critical challenges in designing deep learning models for classifying patient samples due to (i) the samples collected in the same dataset contain a variable number of cells-some samples might only have hundreds of cells sequenced while others could have thousands of cells, and (ii) the number of samples available is typically small and the expression profile of each cell is noisy and extremely high-dimensional. Moreover, the black-box nature of existing deep learning models makes it difficult for the researchers to interpret the models and extract useful knowledge from them.
We propose a prototype-based and cell-informed model for patient phenotype classification, termed ProtoCell4P, that can alleviate problems of the sample scarcity and the diverse number of cells by leveraging the cell knowledge with representatives of cells (called prototypes), and precisely classify the patients by adaptively incorporating information from different cells. Moreover, this classification process can be explicitly interpreted by identifying the key cells for decision making and by further summarizing the knowledge of cell types to unravel the biological nature of the classification. Our approach is explainable at the single-cell resolution which can identify the key cells in each patient's classification. The experimental results demonstrate that our proposed method can effectively deal with patient classifications using single-cell data and outperforms the existing approaches. Furthermore, our approach is able to uncover the association between cell types and biological classes of interest from a data-driven perspective.
过去十年单细胞 RNA 测序 (scRNA-seq) 技术的快速发展为测量患者单细胞的基因表达谱提供了丰富的资源,促进了在单细胞水平上研究许多生物学问题。一个有趣的研究是研究在患者表型中起关键作用的单细胞,这有可能识别出那些驱动疾病表型的细胞和基因。为此,深度学习模型有望很好地编码单细胞信息,并使用 scRNA-seq 数据实现对患者表型的精确预测。然而,由于 (i) 同一数据集中收集的样本包含数量不同的细胞-一些样本可能只有数百个测序细胞,而其他样本可能有数千个细胞,以及 (ii) 可用的样本数量通常较少,每个细胞的表达谱是嘈杂且极高维的,因此,我们在设计用于对患者样本进行分类的深度学习模型时面临着重大挑战。此外,现有深度学习模型的黑盒性质使得研究人员难以对模型进行解释并从中提取有用的知识。
我们提出了一种基于原型和细胞信息的患者表型分类模型,称为 ProtoCell4P,它可以通过利用细胞知识和细胞代表(称为原型)来缓解样本稀缺和细胞数量多样化的问题,并通过自适应地整合来自不同细胞的信息来精确地对患者进行分类。此外,通过识别决策的关键细胞并进一步总结细胞类型的知识来揭示分类的生物学性质,可以显式地解释这个分类过程。我们的方法在单细胞分辨率上是可解释的,可以识别每个患者分类中的关键细胞。实验结果表明,我们提出的方法可以有效地使用单细胞数据进行患者分类,并优于现有方法。此外,我们的方法能够从数据驱动的角度揭示细胞类型和感兴趣的生物学类别之间的关联。