Guo Fangfang, Gan Dailin, Li Jun
Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA.
Comput Struct Biotechnol J. 2024 Nov 4;23:3929-3937. doi: 10.1016/j.csbj.2024.10.044. eCollection 2024 Dec.
The application of large-language models (LLMs) to single-cell gene-expression data has introduced a new type of data that includes a gene-embedding matrix, in addition to the experimentally obtained gene-expression matrix. This paper addresses a fundamental problem in analyzing such data: how to effectively combine the information from both matrices to better define cell-to-cell distance. We identify a computationally feasible solution that demonstrates superior ability to cluster cells of the same type across all six real datasets we tested, underscoring its advantage as a measure of cell-to-cell distance.
将大语言模型(LLMs)应用于单细胞基因表达数据引入了一种新型数据,除了通过实验获得的基因表达矩阵外,还包括一个基因嵌入矩阵。本文解决了分析此类数据中的一个基本问题:如何有效整合来自两个矩阵的信息,以更好地定义细胞间距离。我们确定了一种计算上可行的解决方案,该方案在我们测试的所有六个真实数据集上均表现出卓越的能力,能够将相同类型的细胞聚类在一起,凸显了其作为细胞间距离度量的优势。