利用大语言模型和可解释机器学习分析单细胞水平的生物标志物，以加深对疾病的理解。

Using LLMs and Explainable ML to Analyze Biomarkers at Single-Cell Level for Improved Understanding of Diseases.

机构信息

Department of Energy Conversion and Storage, Technical University of Denmark, 2800 Kongens Lyngby, Denmark.

Abzu ApS, 2150 København, Denmark.

出版信息

Biomolecules. 2023 Oct 12;13(10):1516. doi: 10.3390/biom13101516.

DOI:10.3390/biom13101516

PMID:37892198

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10605495/

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our understanding of the diversity of cells and how this diversity is implicated in diseases. Yet, translating these findings across various scRNA-seq datasets poses challenges due to technical variability and dataset-specific biases. To overcome this, we present a novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes. Our approach uses scBERT, which harnesses shared transcriptomic features among cell types to establish consistent cell-type annotations across multiple scRNA-seq datasets. Additionally, we employed a symbolic regression algorithm to pinpoint highly relevant, yet minimally redundant models and features for inferring a cell type's disease state based on its transcriptomic profile. We ascertained the versatility of these cell-specific gene signatures across datasets, showcasing their resilience as molecular markers to pinpoint and characterize disease-associated cell types. The validation was carried out using four publicly available scRNA-seq datasets from both healthy individuals and those suffering from ulcerative colitis (UC). This demonstrates our approach's efficacy in bridging disparities specific to different datasets, fostering comparative analyses. Notably, the simplicity and symbolic nature of the retrieved gene signatures facilitate their interpretability, allowing us to elucidate underlying molecular disease mechanisms using these models.

摘要

单细胞 RNA 测序 (scRNA-seq) 技术极大地促进了我们对细胞多样性的理解，以及这种多样性如何与疾病相关联。然而，由于技术可变性和数据集特定的偏差，将这些发现转化到各种 scRNA-seq 数据集中仍然具有挑战性。为了克服这一困难，我们提出了一种新的方法，该方法结合了基于大语言模型的框架和可解释的机器学习，以促进单细胞数据集之间的泛化，并识别基因特征，以捕获疾病驱动的转录变化。我们的方法使用 scBERT，它利用细胞类型之间共享的转录组特征，在多个 scRNA-seq 数据集中建立一致的细胞类型注释。此外，我们还采用了符号回归算法来确定高度相关但最小冗余的模型和特征，以便根据转录组图谱推断细胞类型的疾病状态。我们确定了这些细胞特异性基因特征在数据集之间的多功能性，展示了它们作为分子标记的弹性，以精确定位和表征与疾病相关的细胞类型。使用来自健康个体和溃疡性结肠炎 (UC) 患者的四个公开 scRNA-seq 数据集进行了验证。这表明我们的方法在弥合不同数据集特有的差异、促进比较分析方面的有效性。值得注意的是，所检索到的基因特征的简单性和符号性质促进了它们的可解释性，使我们能够使用这些模型阐明潜在的分子疾病机制。

相似文献

Using LLMs and Explainable ML to Analyze Biomarkers at Single-Cell Level for Improved Understanding of Diseases.利用大语言模型和可解释机器学习分析单细胞水平的生物标志物，以加深对疾病的理解。

Biomolecules. 2023 Oct 12;13(10):1516. doi: 10.3390/biom13101516.

A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.一种用于隐性营养不良型大疱性表皮松解症的单细胞 RNA-seq 分析的多任务聚类方法。

PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr.

scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data.scNPF：一种基于网络传播和网络融合的综合框架，用于单细胞 RNA-seq 数据的预处理。

BMC Genomics. 2019 May 8;20(1):347. doi: 10.1186/s12864-019-5747-5.

Learning deep features and topological structure of cells for clustering of scRNA-sequencing data.学习 scRNA-seq 数据聚类的细胞深度特征和拓扑结构。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac068.

On the use of QDE-SVM for gene feature selection and cell type classification from scRNA-seq data.基于 QDE-SVM 的 scRNA-seq 数据基因特征选择和细胞类型分类方法。

PLoS One. 2023 Oct 19;18(10):e0292961. doi: 10.1371/journal.pone.0292961. eCollection 2023.

Single-Cell Transcriptional Survey of Ileal-Anal Pouch Immune Cells From Ulcerative Colitis Patients.溃疡性结肠炎患者回肠-肛管袋免疫细胞的单细胞转录组学研究。

Gastroenterology. 2021 Apr;160(5):1679-1693. doi: 10.1053/j.gastro.2020.12.030. Epub 2021 Feb 5.

One Cell At a Time (OCAT): a unified framework to integrate and analyze single-cell RNA-seq data.逐个细胞分析（OCAT）：一个集成和分析单细胞 RNA-seq 数据的统一框架。

Genome Biol. 2022 Apr 20;23(1):102. doi: 10.1186/s13059-022-02659-1.

Deep enhanced constraint clustering based on contrastive learning for scRNA-seq data.基于对比学习的深度增强约束聚类算法在单细胞 RNA-seq 数据分析中的应用。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad222.

JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets.《铃儿响叮当》：一个免疫相关单细胞RNA测序数据集的储存库。

J Immunol. 2017 May 1;198(9):3375-3379. doi: 10.4049/jimmunol.1700272.

scDSSC: Deep Sparse Subspace Clustering for scRNA-seq Data.scDSSC：用于 scRNA-seq 数据的深度稀疏子空间聚类。

PLoS Comput Biol. 2022 Dec 19;18(12):e1010772. doi: 10.1371/journal.pcbi.1010772. eCollection 2022 Dec.

引用本文的文献

Large language models (LLMs) might be the future research language of nucleic acid.大语言模型可能会成为未来核酸研究的语言。

Int J Surg. 2025 Sep 1;111(9):6534-6536. doi: 10.1097/JS9.0000000000002694. Epub 2025 Jun 20.

Identification of biomarkers and target drugs for melanoma: a topological and deep learning approach.黑色素瘤生物标志物和靶向药物的鉴定：一种拓扑学与深度学习方法

Front Genet. 2025 Mar 3;16:1471037. doi: 10.3389/fgene.2025.1471037. eCollection 2025.

Large language models and their applications in bioinformatics.大语言模型及其在生物信息学中的应用。

Comput Struct Biotechnol J. 2024 Oct 5;23:3498-3505. doi: 10.1016/j.csbj.2024.09.031. eCollection 2024 Dec.

Transcriptomics analysis reveals molecular alterations underpinning spaceflight dermatology.转录组学分析揭示了航天皮肤病学背后的分子改变。

Commun Med (Lond). 2024 Jun 11;4(1):106. doi: 10.1038/s43856-024-00532-9.

Parameter-Efficient Fine-Tuning Enhances Adaptation of Single Cell Large Language Model for Cell Type Identification.参数高效微调增强了单细胞大语言模型在细胞类型识别中的适应性。

bioRxiv. 2024 Jan 30:2024.01.27.577455. doi: 10.1101/2024.01.27.577455.

本文引用的文献

Best practices for single-cell analysis across modalities.多模态单细胞分析的最佳实践。

Nat Rev Genet. 2023 Aug;24(8):550-572. doi: 10.1038/s41576-023-00586-w. Epub 2023 Mar 31.

Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics.通过整合单细胞 RNA 测序和人类遗传学来鉴定疾病关键细胞类型和细胞过程。

Nat Genet. 2022 Oct;54(10):1479-1492. doi: 10.1038/s41588-022-01187-9. Epub 2022 Sep 29.

Machine learning for cell type classification from single nucleus RNA sequencing data.基于单细胞 RNA 测序数据的细胞类型分类的机器学习方法。

PLoS One. 2022 Sep 23;17(9):e0275070. doi: 10.1371/journal.pone.0275070. eCollection 2022.

scDLC: a deep learning framework to classify large sample single-cell RNA-seq data.scDLC：一种用于分类大型单细胞 RNA-seq 数据的深度学习框架。

BMC Genomics. 2022 Jul 12;23(1):504. doi: 10.1186/s12864-022-08715-1.

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression.利用符号回归识别组学数据中的相互作用，以发现临床生物标志物。

Bioinformatics. 2022 Aug 2;38(15):3749-3758. doi: 10.1093/bioinformatics/btac405.

Lipocalin 2 potentially contributes to tumorigenesis from colitis via IL-6/STAT3/NF-κB signaling pathway.脂联素 2 通过 IL-6/STAT3/NF-κB 信号通路潜在地促进结肠炎相关的肿瘤发生。

Biosci Rep. 2022 May 27;42(5). doi: 10.1042/BSR20212418.

LCN2 as a Potential Diagnostic Biomarker for Ulcerative Colitis-Associated Carcinogenesis Related to Disease Duration.LCN2作为与疾病持续时间相关的溃疡性结肠炎相关癌变的潜在诊断生物标志物。

Front Oncol. 2022 Jan 17;11:793760. doi: 10.3389/fonc.2021.793760. eCollection 2021.

Cells of the human intestinal tract mapped across space and time.人类肠道细胞的时空图谱。

Nature. 2021 Sep;597(7875):250-255. doi: 10.1038/s41586-021-03852-1. Epub 2021 Sep 8.

An analytical method for the identification of cell type-specific disease gene modules.一种用于鉴定细胞类型特异性疾病基因模块的分析方法。

J Transl Med. 2021 Jan 6;19(1):20. doi: 10.1186/s12967-020-02690-5.

Heterogeneity and clonal relationships of adaptive immune cells in ulcerative colitis revealed by single-cell analyses.单细胞分析揭示溃疡性结肠炎适应性免疫细胞的异质性和克隆关系。

Sci Immunol. 2020 Aug 21;5(50). doi: 10.1126/sciimmunol.abb4432.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用大语言模型和可解释机器学习分析单细胞水平的生物标志物，以加深对疾病的理解。

Using LLMs and Explainable ML to Analyze Biomarkers at Single-Cell Level for Improved Understanding of Diseases.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献