Tiong Khong-Loon, Luzhbin Dmytro, Yeang Chen-Hsiang
Institute of Statistical Science, Academia Sinica, Taipei, Taiwan.
BMC Bioinformatics. 2024 Jun 12;25(1):209. doi: 10.1186/s12859-024-05825-3.
Single-cell RNA sequencing (sc-RNASeq) data illuminate transcriptomic heterogeneity but also possess a high level of noise, abundant missing entries and sometimes inadequate or no cell type annotations at all. Bulk-level gene expression data lack direct information of cell population composition but are more robust and complete and often better annotated. We propose a modeling framework to integrate bulk-level and single-cell RNASeq data to address the deficiencies and leverage the mutual strengths of each type of data and enable a more comprehensive inference of their transcriptomic heterogeneity. Contrary to the standard approaches of factorizing the bulk-level data with one algorithm and (for some methods) treating single-cell RNASeq data as references to decompose bulk-level data, we employed multiple deconvolution algorithms to factorize the bulk-level data, constructed the probabilistic graphical models of cell-level gene expressions from the decomposition outcomes, and compared the log-likelihood scores of these models in single-cell data. We term this framework backward deconvolution as inference operates from coarse-grained bulk-level data to fine-grained single-cell data. As the abundant missing entries in sc-RNASeq data have a significant effect on log-likelihood scores, we also developed a criterion for inclusion or exclusion of zero entries in log-likelihood score computation.
We selected nine deconvolution algorithms and validated backward deconvolution in five datasets. In the in-silico mixtures of mouse sc-RNASeq data, the log-likelihood scores of the deconvolution algorithms were strongly anticorrelated with their errors of mixture coefficients and cell type specific gene expression signatures. In the true bulk-level mouse data, the sample mixture coefficients were unknown but the log-likelihood scores were strongly correlated with accuracy rates of inferred cell types. In the data of autism spectrum disorder (ASD) and normal controls, we found that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGN-expressing neurons than normal controls. In datasets of breast cancer and low-grade gliomas (LGG), we compared the log-likelihood scores of three simple hypotheses about the gene expression patterns of the cell types underlying the tumor subtypes. The model that tumors of each subtype were dominated by one cell type persistently outperformed an alternative model that each cell type had elevated expression in one gene group and tumors were mixtures of those cell types. Superiority of the former model is also supported by comparing the real breast cancer sc-RNASeq clusters with those generated by simulated sc-RNASeq data.
The results indicate that backward deconvolution serves as a sensible model selection tool for deconvolution algorithms and facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens such as tumors.
单细胞RNA测序(sc-RNASeq)数据揭示了转录组的异质性,但也存在高水平的噪声、大量缺失值,有时甚至完全缺乏或仅有不充分的细胞类型注释。批量水平的基因表达数据缺乏细胞群体组成的直接信息,但更稳健、完整,且通常注释更好。我们提出了一个建模框架,用于整合批量水平和单细胞RNA测序数据,以弥补这些不足,利用每种数据类型的互补优势,从而更全面地推断它们的转录组异质性。与使用一种算法分解批量水平数据(以及某些方法将单细胞RNA测序数据作为参考来分解批量水平数据)的标准方法不同,我们采用多种反卷积算法分解批量水平数据,根据分解结果构建细胞水平基因表达的概率图模型,并在单细胞数据中比较这些模型的对数似然分数。我们将这个框架称为反向反卷积,因为推断是从粗粒度的批量水平数据到细粒度的单细胞数据。由于sc-RNASeq数据中大量的缺失值对对数似然分数有显著影响,我们还开发了一个在对数似然分数计算中包含或排除零值的标准。
我们选择了九种反卷积算法,并在五个数据集中验证了反向反卷积。在小鼠sc-RNASeq数据的计算机模拟混合物中,反卷积算法的对数似然分数与其混合系数误差和细胞类型特异性基因表达特征强烈负相关。在真实的小鼠批量水平数据中,样本混合系数未知,但对数似然分数与推断细胞类型的准确率强烈相关。在自闭症谱系障碍(ASD)和正常对照的数据中,我们发现与正常对照相比,ASD大脑中星形胶质细胞的比例更高,而表达NRGN的神经元比例更低。在乳腺癌和低级别胶质瘤(LGG)的数据集中,我们比较了关于肿瘤亚型潜在细胞类型基因表达模式的三个简单假设的对数似然分数。每个亚型的肿瘤由一种细胞类型主导的模型始终优于另一种模型,即每种细胞类型在一个基因组中表达升高,肿瘤是这些细胞类型的混合物。通过将真实的乳腺癌sc-RNASeq聚类与模拟的sc-RNASeq数据生成的聚类进行比较,也支持了前一种模型的优越性。
结果表明,反向反卷积是反卷积算法的一种合理的模型选择工具,有助于辨别关于肿瘤等异质样本潜在细胞类型组成的假设。