Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, Durham, NC, 27709, USA.
BMC Bioinformatics. 2021 May 24;22(1):262. doi: 10.1186/s12859-021-04186-5.
Biological tissues consist of heterogenous populations of cells. Because gene expression patterns from bulk tissue samples reflect the contributions from all cells in the tissue, understanding the contribution of individual cell types to the overall gene expression in the tissue is fundamentally important. We recently developed a computational method, CDSeq, that can simultaneously estimate both sample-specific cell-type proportions and cell-type-specific gene expression profiles using only bulk RNA-Seq counts from multiple samples. Here we present an R implementation of CDSeq (CDSeqR) with significant performance improvement over the original implementation in MATLAB and an added new function to aid cell type annotation. The R package would be of interest for the broader R community.
We developed a novel strategy to substantially improve computational efficiency in both speed and memory usage. In addition, we designed and implemented a new function for annotating the CDSeq estimated cell types using single-cell RNA sequencing (scRNA-seq) data. This function allows users to readily interpret and visualize the CDSeq estimated cell types. In addition, this new function further allows the users to annotate CDSeq-estimated cell types using marker genes. We carried out additional validations of the CDSeqR software using synthetic, real cell mixtures, and real bulk RNA-seq data from the Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project.
The existing bulk RNA-seq repositories, such as TCGA and GTEx, provide enormous resources for better understanding changes in transcriptomics and human diseases. They are also potentially useful for studying cell-cell interactions in the tissue microenvironment. Bulk level analyses neglect tissue heterogeneity, however, and hinder investigation of a cell-type-specific expression. The CDSeqR package may aid in silico dissection of bulk expression data, enabling researchers to recover cell-type-specific information.
生物组织由异质细胞群体组成。由于来自组织的批量样本的基因表达模式反映了组织中所有细胞的贡献,因此了解单个细胞类型对组织中整体基因表达的贡献从根本上是重要的。我们最近开发了一种计算方法 CDSeq,它可以仅使用来自多个样本的批量 RNA-Seq 计数同时估计样本特异性细胞类型比例和细胞类型特异性基因表达谱。在这里,我们提出了 CDSeq 的 R 实现(CDSeqR),与原始的 MATLAB 实现相比,它具有显著的性能改进,并增加了一个新功能来辅助细胞类型注释。这个 R 包将引起更广泛的 R 社区的兴趣。
我们开发了一种新策略,可以大大提高计算效率,包括速度和内存使用。此外,我们设计并实现了一个新功能,用于使用单细胞 RNA 测序 (scRNA-seq) 数据注释 CDSeq 估计的细胞类型。该功能允许用户轻松解释和可视化 CDSeq 估计的细胞类型。此外,这个新功能还允许用户使用标记基因注释 CDSeq 估计的细胞类型。我们使用合成的、真实的细胞混合物以及来自癌症基因组图谱 (TCGA) 和基因型组织表达 (GTEx) 项目的真实批量 RNA-seq 数据,对 CDSeqR 软件进行了额外的验证。
现有的批量 RNA-seq 存储库,如 TCGA 和 GTEx,为更好地理解转录组学和人类疾病的变化提供了巨大的资源。它们也可能有助于研究组织微环境中的细胞间相互作用。然而,批量水平分析忽略了组织异质性,并阻碍了对细胞类型特异性表达的研究。CDSeqR 包可能有助于对批量表达数据进行计算分析,使研究人员能够恢复细胞类型特异性信息。