Department of Molecular Cellular and Developmental Biology, University of California Los Angeles, Los Angeles, CA, USA.
Bioinformatics Interdepartmental Degree Program, University of California Los Angeles, Los Angeles, CA, USA.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab265.
Estimating cell type composition of blood and tissue samples is a biological challenge relevant in both laboratory studies and clinical care. In recent years, a number of computational tools have been developed to estimate cell type abundance using gene expression data. Although these tools use a variety of approaches, they all leverage expression profiles from purified cell types to evaluate the cell type composition within samples. In this study, we compare 12 cell type quantification tools and evaluate their performance while using each of 10 separate reference profiles. Specifically, we have run each tool on over 4000 samples with known cell type proportions, spanning both immune and stromal cell types. A total of 12 of these represent in vitro synthetic mixtures and 300 represent in silico synthetic mixtures prepared using single-cell data. A final 3728 clinical samples have been collected from the Framingham cohort, for which cell populations have been quantified using electrical impedance cell counting. When tools are applied to the Framingham dataset, the tool Estimating the Proportions of Immune and Cancer cells (EPIC) produces the highest correlation, whereas Gene Expression Deconvolution Interactive Tool (GEDIT) produces the lowest error. The best tool for other datasets is varied, but CIBERSORT and GEDIT most consistently produce accurate results. We find that optimal reference depends on the tool used, and report suggested references to be used with each tool. Most tools return results within minutes, but on large datasets runtimes for CIBERSORT can exceed hours or even days. We conclude that deconvolution methods are capable of returning high-quality results, but that proper reference selection is critical.
估计血液和组织样本的细胞类型组成是实验室研究和临床护理中都具有挑战性的生物学问题。近年来,已经开发了许多计算工具,可使用基因表达数据估计细胞类型丰度。尽管这些工具使用了各种方法,但它们都利用来自纯化细胞类型的表达谱来评估样品中的细胞类型组成。在这项研究中,我们比较了 12 种细胞类型定量工具,并在使用 10 种单独参考谱中的每一种时评估了它们的性能。具体来说,我们已经在超过 4000 个具有已知细胞类型比例的样本上运行了每个工具,这些样本涵盖了免疫细胞和基质细胞类型。其中共有 12 个代表体外合成混合物,300 个代表使用单细胞数据制备的模拟混合物。最后从 Framingham 队列中收集了 3728 个临床样本,其中细胞群体使用电阻抗细胞计数进行了定量。当将工具应用于 Framingham 数据集时,Estimating the Proportions of Immune and Cancer cells (EPIC) 工具产生的相关性最高,而 Gene Expression Deconvolution Interactive Tool (GEDIT) 工具产生的误差最低。对于其他数据集,最佳工具各不相同,但 CIBERSORT 和 GEDIT 最能始终如一地产生准确的结果。我们发现最佳参考取决于所使用的工具,并报告了每个工具的建议参考。大多数工具在几分钟内返回结果,但对于大型数据集,CIBERSORT 的运行时间可能超过几个小时甚至几天。我们的结论是,去卷积方法能够返回高质量的结果,但正确的参考选择至关重要。