Gerniers Alexander, Bricard Orian, Dupont Pierre
ICTEAM/INGI/Artificial Intelligence and Algorithms Group, UCLouvain, Louvain-la-Neuve 1348, Belgium.
de Duve Institute, UCLouvain, Brussels 1200, Belgium.
Bioinformatics. 2021 Oct 11;37(19):3220-3227. doi: 10.1093/bioinformatics/btab239.
Identifying rare subpopulations of cells is a critical step in order to extract knowledge from single-cell expression data, especially when the available data is limited and rare subpopulations only contain a few cells. In this paper, we present a data mining method to identify small subpopulations of cells that present highly specific expression profiles. This objective is formalized as a constrained optimization problem that jointly identifies a small group of cells and a corresponding subset of specific genes. The proposed method extends the max-sum submatrix problem to yield genes that are, for instance, highly expressed inside a small number of cells, but have a low expression in the remaining ones.
We show through controlled experiments on scRNA-seq data that the MicroCellClust method achieves a high F1 score to identify rare subpopulations of artificially planted human T cells. The effectiveness of MicroCellClust is confirmed as it reveals a subpopulation of CD4 T cells with a specific phenotype from breast cancer samples, and a subpopulation linked to a specific stage in the cell cycle from breast cancer samples as well. Finally, three rare subpopulations in mouse embryonic stem cells are also identified with MicroCellClust. These results illustrate the proposed method outperforms typical alternatives at identifying small subsets of cells with highly specific expression profiles.
The R and Scala implementation of MicroCellClust is freely available on GitHub, at https://github.com/agerniers/MicroCellClust/ The data underlying this article are available on Zenodo, at https://dx.doi.org/10.5281/zenodo.4580332.
Supplementary data are available at Bioinformatics online.
识别细胞的稀有亚群是从单细胞表达数据中提取知识的关键步骤,特别是当可用数据有限且稀有亚群仅包含少数细胞时。在本文中,我们提出了一种数据挖掘方法来识别呈现高度特异性表达谱的细胞小亚群。该目标被形式化为一个约束优化问题,该问题联合识别一小群细胞和相应的特定基因子集。所提出的方法扩展了最大和子矩阵问题,以产生例如在少数细胞内高表达但在其余细胞中低表达的基因。
我们通过对scRNA-seq数据的对照实验表明,MicroCellClust方法在识别人工植入的人类T细胞的稀有亚群方面实现了高F1分数。MicroCellClust的有效性得到了证实,因为它从乳腺癌样本中揭示了具有特定表型的CD4 T细胞亚群,以及也与乳腺癌样本中细胞周期的特定阶段相关的亚群。最后,MicroCellClust还识别出了小鼠胚胎干细胞中的三个稀有亚群。这些结果表明,所提出的方法在识别具有高度特异性表达谱的细胞小子集方面优于典型的替代方法。
MicroCellClust的R和Scala实现可在GitHub上免费获取,网址为https://github.com/agerniers/MicroCellClust/ 本文的基础数据可在Zenodo上获取,网址为https://dx.doi.org/10.5281/zenodo.4580332。
补充数据可在《生物信息学》在线获取。