Li Yujie, Xu Su, Wang Xue, Ertekin-Taner Nilüfer, Chen Duan
Department of Mathematics and Statistics, University of North Carolina at Charlotte, USA.
School of Data Science, University of North Carolina at Charlotte, USA.
Math Biosci Eng. 2025 Mar 14;22(4):988-1018. doi: 10.3934/mbe.2025036.
Performing complete deconvolution analysis for bulk RNA-seq data to obtain both cell type specific gene expression profiles (GEP) and relative cell abundances is a challenging task. One of the fundamental models used, the nonnegative matrix factorization (NMF), is mathematically ill-posed. Although several complete deconvolution methods have been developed, and their estimates compared to ground truth for some datasets appear promising, a comprehensive understanding of how to circumvent the ill-posedness and improve solution accuracy is lacking. In this paper, we first investigated the necessary requirements for a given dataset to satisfy the solvability conditions in NMF theory. Even with solvability conditions, the "unique" solutions of NMF are subject to a rescaling matrix. Therefore, we provide estimates of the converged local minima and the possible rescaling matrix, based on informative initial conditions. Using these strategies, we developed a new pipeline of pseudo-bulk tissue data augmented, geometric structure guided NMF model (GSNMF+). In our approach, pseudo-bulk tissue data was generated, by statistical distribution simulated pseudo cellular compositions and single-cell RNA-seq (scRNA-seq) data, and then mixed with the original dataset. The constituent matrices of the hybrid dataset then satisfy the weak solvability conditions of NMF. Furthermore, an estimated rescaling matrix was used to adjust the minimizer of the NMF, which was expected to reduce mean square root errors of solutions. Our algorithms are tested on several realistic bulk-tissue datasets and showed significant improvements in scenarios with singular cellular compositions.
对批量RNA测序数据进行完整的去卷积分析,以获得细胞类型特异性基因表达谱(GEP)和相对细胞丰度,是一项具有挑战性的任务。所使用的基本模型之一,非负矩阵分解(NMF),在数学上是不适定的。尽管已经开发了几种完整的去卷积方法,并且它们对某些数据集的估计与真实情况相比看起来很有前景,但对于如何规避不适定性并提高解的准确性仍缺乏全面的理解。在本文中,我们首先研究了给定数据集满足NMF理论中可解性条件的必要要求。即使满足可解性条件,NMF的“唯一”解也受缩放矩阵的影响。因此,我们基于信息丰富的初始条件,提供了收敛局部最小值和可能的缩放矩阵的估计。使用这些策略,我们开发了一种新的伪批量组织数据增强、几何结构引导的NMF模型(GSNMF+)的流程。在我们的方法中,通过统计分布模拟伪细胞组成和单细胞RNA测序(scRNA-seq)数据生成伪批量组织数据,然后与原始数据集混合。然后,混合数据集的组成矩阵满足NMF的弱可解性条件。此外,使用估计的缩放矩阵来调整NMF的极小值,这有望降低解的均方根误差。我们的算法在几个实际的批量组织数据集上进行了测试,并在具有奇异细胞组成的场景中显示出显著的改进。