scOTM：一种使用大语言模型预测单细胞扰动反应的深度学习框架。

scOTM: A Deep Learning Framework for Predicting Single-Cell Perturbation Responses with Large Language Models.

作者信息

Wang Yuchen, Lu Tianchi, Chen Xingjian, Yao Zhongyu, Wong Ka-Chun

机构信息

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR 999077, China.

Cutaneous Biology Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02148, USA.

出版信息

Bioengineering (Basel). 2025 Aug 20;12(8):884. doi: 10.3390/bioengineering12080884.

DOI:10.3390/bioengineering12080884

PMID:40868397

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12383350/

Abstract

Modeling drug-induced transcriptional responses at the single-cell level is essential for advancing human healthcare, particularly in understanding disease mechanisms, assessing therapeutic efficacy, and anticipating adverse effects. However, existing approaches often impose a rigid constraint by enforcing pointwise alignment of latent representations to a standard normal prior, which limits expressiveness and results in biologically uninformative embeddings, especially in complex biological systems. Additionally, many methods inadequately address the challenges of unpaired data, typically relying on naive averaging strategies that ignore cell-type specificity and intercellular heterogeneity. To overcome these limitations, we propose scOTM, a deep learning framework designed to predict single-cell perturbation responses from unpaired data, focusing on generalization to unseen cell types. scOTM integrates prior biological knowledge of perturbations and cellular states, derived from large language models specialized for molecular and single-cell corpora. These informative representations are incorporated into a variational autoencoder with maximum mean discrepancy regularization, allowing flexible modeling of transcriptional shifts without imposing a strict constraint of alignment to a standard normal prior. scOTM further employs optimal transport to establish an efficient and interpretable mapping between control and perturbed distributions, effectively capturing the transcriptional shifts underlying response variation. Extensive experiments demonstrate that scOTM outperforms existing methods in predicting whole-transcriptome responses and identifying top differentially expressed genes. Furthermore, scOTM exhibits superior robustness in data-limited settings and strong generalization capabilities across cell types.

摘要

在单细胞水平上对药物诱导的转录反应进行建模对于推动人类医疗保健至关重要，特别是在理解疾病机制、评估治疗效果和预测不良反应方面。然而，现有方法通常通过强制将潜在表示逐点对齐到标准正态先验来施加严格约束，这限制了表达能力并导致生物学上无信息的嵌入，尤其是在复杂的生物系统中。此外，许多方法不能充分应对未配对数据的挑战，通常依赖于忽略细胞类型特异性和细胞间异质性的简单平均策略。为了克服这些限制，我们提出了scOTM，这是一个深度学习框架，旨在从未配对数据中预测单细胞扰动反应，重点是对未见过的细胞类型进行泛化。scOTM整合了来自专门针对分子和单细胞语料库的大语言模型的扰动和细胞状态的先验生物学知识。这些信息丰富的表示被纳入具有最大均值差异正则化的变分自编码器中，允许对转录变化进行灵活建模，而无需对与标准正态先验的对齐施加严格约束。scOTM进一步采用最优传输来在对照分布和扰动分布之间建立高效且可解释的映射，有效地捕获响应变化背后的转录变化。大量实验表明，scOTM在预测全转录组反应和识别顶级差异表达基因方面优于现有方法。此外，scOTM在数据有限的环境中表现出卓越的稳健性，并且在不同细胞类型之间具有强大的泛化能力。