Department of Energy Conversion and Storage, Technical University of Denmark, 2800 Kongens Lyngby, Denmark.
Abzu ApS, 2150 København, Denmark.
Biomolecules. 2023 Oct 12;13(10):1516. doi: 10.3390/biom13101516.
Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our understanding of the diversity of cells and how this diversity is implicated in diseases. Yet, translating these findings across various scRNA-seq datasets poses challenges due to technical variability and dataset-specific biases. To overcome this, we present a novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes. Our approach uses scBERT, which harnesses shared transcriptomic features among cell types to establish consistent cell-type annotations across multiple scRNA-seq datasets. Additionally, we employed a symbolic regression algorithm to pinpoint highly relevant, yet minimally redundant models and features for inferring a cell type's disease state based on its transcriptomic profile. We ascertained the versatility of these cell-specific gene signatures across datasets, showcasing their resilience as molecular markers to pinpoint and characterize disease-associated cell types. The validation was carried out using four publicly available scRNA-seq datasets from both healthy individuals and those suffering from ulcerative colitis (UC). This demonstrates our approach's efficacy in bridging disparities specific to different datasets, fostering comparative analyses. Notably, the simplicity and symbolic nature of the retrieved gene signatures facilitate their interpretability, allowing us to elucidate underlying molecular disease mechanisms using these models.
单细胞 RNA 测序 (scRNA-seq) 技术极大地促进了我们对细胞多样性的理解,以及这种多样性如何与疾病相关联。然而,由于技术可变性和数据集特定的偏差,将这些发现转化到各种 scRNA-seq 数据集中仍然具有挑战性。为了克服这一困难,我们提出了一种新的方法,该方法结合了基于大语言模型的框架和可解释的机器学习,以促进单细胞数据集之间的泛化,并识别基因特征,以捕获疾病驱动的转录变化。我们的方法使用 scBERT,它利用细胞类型之间共享的转录组特征,在多个 scRNA-seq 数据集中建立一致的细胞类型注释。此外,我们还采用了符号回归算法来确定高度相关但最小冗余的模型和特征,以便根据转录组图谱推断细胞类型的疾病状态。我们确定了这些细胞特异性基因特征在数据集之间的多功能性,展示了它们作为分子标记的弹性,以精确定位和表征与疾病相关的细胞类型。使用来自健康个体和溃疡性结肠炎 (UC) 患者的四个公开 scRNA-seq 数据集进行了验证。这表明我们的方法在弥合不同数据集特有的差异、促进比较分析方面的有效性。值得注意的是,所检索到的基因特征的简单性和符号性质促进了它们的可解释性,使我们能够使用这些模型阐明潜在的分子疾病机制。