通过批量水平基因表达数据评估单细胞RNA测序数据的转录组异质性。

Tiong Khong-Loon, Luzhbin Dmytro, Yeang Chen-Hsiang

Institute of Statistical Science, Academia Sinica, Taipei, Taiwan.

BMC Bioinformatics. 2024 Jun 12;25(1):209. doi: 10.1186/s12859-024-05825-3.

BACKGROUND

Single-cell RNA sequencing (sc-RNASeq) data illuminate transcriptomic heterogeneity but also possess a high level of noise, abundant missing entries and sometimes inadequate or no cell type annotations at all. Bulk-level gene expression data lack direct information of cell population composition but are more robust and complete and often better annotated. We propose a modeling framework to integrate bulk-level and single-cell RNASeq data to address the deficiencies and leverage the mutual strengths of each type of data and enable a more comprehensive inference of their transcriptomic heterogeneity. Contrary to the standard approaches of factorizing the bulk-level data with one algorithm and (for some methods) treating single-cell RNASeq data as references to decompose bulk-level data, we employed multiple deconvolution algorithms to factorize the bulk-level data, constructed the probabilistic graphical models of cell-level gene expressions from the decomposition outcomes, and compared the log-likelihood scores of these models in single-cell data. We term this framework backward deconvolution as inference operates from coarse-grained bulk-level data to fine-grained single-cell data. As the abundant missing entries in sc-RNASeq data have a significant effect on log-likelihood scores, we also developed a criterion for inclusion or exclusion of zero entries in log-likelihood score computation.

RESULTS

We selected nine deconvolution algorithms and validated backward deconvolution in five datasets. In the in-silico mixtures of mouse sc-RNASeq data, the log-likelihood scores of the deconvolution algorithms were strongly anticorrelated with their errors of mixture coefficients and cell type specific gene expression signatures. In the true bulk-level mouse data, the sample mixture coefficients were unknown but the log-likelihood scores were strongly correlated with accuracy rates of inferred cell types. In the data of autism spectrum disorder (ASD) and normal controls, we found that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGN-expressing neurons than normal controls. In datasets of breast cancer and low-grade gliomas (LGG), we compared the log-likelihood scores of three simple hypotheses about the gene expression patterns of the cell types underlying the tumor subtypes. The model that tumors of each subtype were dominated by one cell type persistently outperformed an alternative model that each cell type had elevated expression in one gene group and tumors were mixtures of those cell types. Superiority of the former model is also supported by comparing the real breast cancer sc-RNASeq clusters with those generated by simulated sc-RNASeq data.

CONCLUSIONS

The results indicate that backward deconvolution serves as a sensible model selection tool for deconvolution algorithms and facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens such as tumors.

背景

单细胞RNA测序（sc-RNASeq）数据揭示了转录组的异质性，但也存在高水平的噪声、大量缺失值，有时甚至完全缺乏或仅有不充分的细胞类型注释。批量水平的基因表达数据缺乏细胞群体组成的直接信息，但更稳健、完整，且通常注释更好。我们提出了一个建模框架，用于整合批量水平和单细胞RNA测序数据，以弥补这些不足，利用每种数据类型的互补优势，从而更全面地推断它们的转录组异质性。与使用一种算法分解批量水平数据（以及某些方法将单细胞RNA测序数据作为参考来分解批量水平数据）的标准方法不同，我们采用多种反卷积算法分解批量水平数据，根据分解结果构建细胞水平基因表达的概率图模型，并在单细胞数据中比较这些模型的对数似然分数。我们将这个框架称为反向反卷积，因为推断是从粗粒度的批量水平数据到细粒度的单细胞数据。由于sc-RNASeq数据中大量的缺失值对对数似然分数有显著影响，我们还开发了一个在对数似然分数计算中包含或排除零值的标准。

结果

我们选择了九种反卷积算法，并在五个数据集中验证了反向反卷积。在小鼠sc-RNASeq数据的计算机模拟混合物中，反卷积算法的对数似然分数与其混合系数误差和细胞类型特异性基因表达特征强烈负相关。在真实的小鼠批量水平数据中，样本混合系数未知，但对数似然分数与推断细胞类型的准确率强烈相关。在自闭症谱系障碍（ASD）和正常对照的数据中，我们发现与正常对照相比，ASD大脑中星形胶质细胞的比例更高，而表达NRGN的神经元比例更低。在乳腺癌和低级别胶质瘤（LGG）的数据集中，我们比较了关于肿瘤亚型潜在细胞类型基因表达模式的三个简单假设的对数似然分数。每个亚型的肿瘤由一种细胞类型主导的模型始终优于另一种模型，即每种细胞类型在一个基因组中表达升高，肿瘤是这些细胞类型的混合物。通过将真实的乳腺癌sc-RNASeq聚类与模拟的sc-RNASeq数据生成的聚类进行比较，也支持了前一种模型的优越性。

结论

结果表明，反向反卷积是反卷积算法的一种合理的模型选择工具，有助于辨别关于肿瘤等异质样本潜在细胞类型组成的假设。

相似文献

Assessing transcriptomic heterogeneity of single-cell RNASeq data by bulk-level gene expression data.

BMC Bioinformatics. 2024 Jun 12;25(1):209. doi: 10.1186/s12859-024-05825-3.

Characterization of gene cluster heterogeneity in single-cell transcriptomic data within and across cancer types.

Biol Open. 2022 Jun 15;11(6). doi: 10.1242/bio.059256. Epub 2022 Jun 23.

Deconvolution from bulk gene expression by leveraging sample-wise and gene-wise similarities and single-cell RNA-Seq data.

BMC Genomics. 2024 Sep 18;25(1):875. doi: 10.1186/s12864-024-10728-x.

Likelihood-based deconvolution of bulk gene expression data using single-cell references.

Genome Res. 2021 Oct;31(10):1794-1806. doi: 10.1101/gr.272344.120. Epub 2021 Jul 22.

New generative methods for single-cell transcriptome data in bulk RNA sequence deconvolution.

Sci Rep. 2024 Feb 20;14(1):4156. doi: 10.1038/s41598-024-54798-z.

Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data.

Nat Commun. 2021 Oct 20;12(1):6106. doi: 10.1038/s41467-021-26328-2.

NNICE: a deep quantile neural network algorithm for expression deconvolution.

Sci Rep. 2024 Jun 18;14(1):14040. doi: 10.1038/s41598-024-65053-w.

HArmonized single-cell RNA-seq Cell type Assisted Deconvolution (HASCAD).

BMC Med Genomics. 2023 Oct 31;16(Suppl 2):272. doi: 10.1186/s12920-023-01674-w.

Comprehensive evaluation of deconvolution methods for human brain gene expression.

Nat Commun. 2022 Mar 15;13(1):1358. doi: 10.1038/s41467-022-28655-4.

Deconvolution analysis of cell-type expression from bulk tissues by integrating with single-cell expression reference.

Genet Epidemiol. 2022 Dec;46(8):615-628. doi: 10.1002/gepi.22494. Epub 2022 Jul 5.

引用本文的文献

RNaseH-based ribodepletion of total planarian RNA improves detection of longer and non-polyadenylated transcripts.

bioRxiv. 2024 Jul 21:2024.07.20.604429. doi: 10.1101/2024.07.20.604429.

本文引用的文献

Evaluating imputation methods for single-cell RNA-seq data.

BMC Bioinformatics. 2023 Jul 28;24(1):302. doi: 10.1186/s12859-023-05417-7.

An interpretable single-cell RNA sequencing data clustering method based on latent Dirichlet allocation.

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad199.

An integrated analysis of the cancer genome atlas data discovers a hierarchical association structure across thirty three cancer types.

PLOS Digit Health. 2022 Dec 20;1(12):e0000151. doi: 10.1371/journal.pdig.0000151. eCollection 2022 Dec.

Evaluation of single-cell RNAseq labelling algorithms using cancer datasets.

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac561.

Semi-deconvolution of bulk and single-cell RNA-seq data with application to metastatic progression in breast cancer.

Bioinformatics. 2022 Jun 24;38(Suppl 1):i386-i394. doi: 10.1093/bioinformatics/btac262.

Evaluating the performance of dropout imputation and clustering methods for single-cell RNA sequencing data.

Comput Biol Med. 2022 Jul;146:105697. doi: 10.1016/j.compbiomed.2022.105697. Epub 2022 Jun 8.

Characterization of gene cluster heterogeneity in single-cell transcriptomic data within and across cancer types.

Biol Open. 2022 Jun 15;11(6). doi: 10.1242/bio.059256. Epub 2022 Jun 23.

Comprehensive evaluation of deconvolution methods for human brain gene expression.

Nat Commun. 2022 Mar 15;13(1):1358. doi: 10.1038/s41467-022-28655-4.

Epigenetic encoding, heritability and plasticity of glioma transcriptional cell states.

Nat Genet. 2021 Oct;53(10):1469-1479. doi: 10.1038/s41588-021-00927-7. Epub 2021 Sep 30.

A single-cell and spatially resolved atlas of human breast cancers.

Nat Genet. 2021 Sep;53(9):1334-1347. doi: 10.1038/s41588-021-00911-1. Epub 2021 Sep 6.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Assessing transcriptomic heterogeneity of single-cell RNASeq data by bulk-level gene expression data.

BMC Bioinformatics. 2024 Jun 12;25(1):209. doi: 10.1186/s12859-024-05825-3.

Characterization of gene cluster heterogeneity in single-cell transcriptomic data within and across cancer types.

Biol Open. 2022 Jun 15;11(6). doi: 10.1242/bio.059256. Epub 2022 Jun 23.

Deconvolution from bulk gene expression by leveraging sample-wise and gene-wise similarities and single-cell RNA-Seq data.

BMC Genomics. 2024 Sep 18;25(1):875. doi: 10.1186/s12864-024-10728-x.

Likelihood-based deconvolution of bulk gene expression data using single-cell references.

Genome Res. 2021 Oct;31(10):1794-1806. doi: 10.1101/gr.272344.120. Epub 2021 Jul 22.

New generative methods for single-cell transcriptome data in bulk RNA sequence deconvolution.

Sci Rep. 2024 Feb 20;14(1):4156. doi: 10.1038/s41598-024-54798-z.

Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data.

Nat Commun. 2021 Oct 20;12(1):6106. doi: 10.1038/s41467-021-26328-2.

NNICE: a deep quantile neural network algorithm for expression deconvolution.

Sci Rep. 2024 Jun 18;14(1):14040. doi: 10.1038/s41598-024-65053-w.

HArmonized single-cell RNA-seq Cell type Assisted Deconvolution (HASCAD).

BMC Med Genomics. 2023 Oct 31;16(Suppl 2):272. doi: 10.1186/s12920-023-01674-w.

Comprehensive evaluation of deconvolution methods for human brain gene expression.

Nat Commun. 2022 Mar 15;13(1):1358. doi: 10.1038/s41467-022-28655-4.

Deconvolution analysis of cell-type expression from bulk tissues by integrating with single-cell expression reference.

Genet Epidemiol. 2022 Dec;46(8):615-628. doi: 10.1002/gepi.22494. Epub 2022 Jul 5.

引用本文的文献

RNaseH-based ribodepletion of total planarian RNA improves detection of longer and non-polyadenylated transcripts.

bioRxiv. 2024 Jul 21:2024.07.20.604429. doi: 10.1101/2024.07.20.604429.

本文引用的文献

Evaluating imputation methods for single-cell RNA-seq data.

BMC Bioinformatics. 2023 Jul 28;24(1):302. doi: 10.1186/s12859-023-05417-7.

An interpretable single-cell RNA sequencing data clustering method based on latent Dirichlet allocation.

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad199.

An integrated analysis of the cancer genome atlas data discovers a hierarchical association structure across thirty three cancer types.

PLOS Digit Health. 2022 Dec 20;1(12):e0000151. doi: 10.1371/journal.pdig.0000151. eCollection 2022 Dec.

Evaluation of single-cell RNAseq labelling algorithms using cancer datasets.

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac561.

Semi-deconvolution of bulk and single-cell RNA-seq data with application to metastatic progression in breast cancer.

Bioinformatics. 2022 Jun 24;38(Suppl 1):i386-i394. doi: 10.1093/bioinformatics/btac262.

Evaluating the performance of dropout imputation and clustering methods for single-cell RNA sequencing data.

Comput Biol Med. 2022 Jul;146:105697. doi: 10.1016/j.compbiomed.2022.105697. Epub 2022 Jun 8.

Characterization of gene cluster heterogeneity in single-cell transcriptomic data within and across cancer types.

Biol Open. 2022 Jun 15;11(6). doi: 10.1242/bio.059256. Epub 2022 Jun 23.

Comprehensive evaluation of deconvolution methods for human brain gene expression.

Nat Commun. 2022 Mar 15;13(1):1358. doi: 10.1038/s41467-022-28655-4.

Epigenetic encoding, heritability and plasticity of glioma transcriptional cell states.

Nat Genet. 2021 Oct;53(10):1469-1479. doi: 10.1038/s41588-021-00927-7. Epub 2021 Sep 30.

A single-cell and spatially resolved atlas of human breast cancers.

Nat Genet. 2021 Sep;53(9):1334-1347. doi: 10.1038/s41588-021-00911-1. Epub 2021 Sep 6.

Suppr
超能文献

Assessing transcriptomic heterogeneity of single-cell RNASeq data by bulk-level gene expression data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献

Suppr超能文献

通过批量水平基因表达数据评估单细胞RNA测序数据的转录组异质性。

Assessing transcriptomic heterogeneity of single-cell RNASeq data by bulk-level gene expression data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献

Suppr
超能文献