Radley Arthur, Boeing Stefan, Smith Austin
Living Systems Institute, University of Exeter, Stocker Road, Exeter EX4 4QD, UK.
Bioinformatics and Biostatistics Science Technology Platform, The Francis Crick Institute, London NW1 1AT, UK.
Development. 2024 Jun 1;151(11). doi: 10.1242/dev.202832. Epub 2024 Jun 13.
Analysis of single cell transcriptomics (scRNA-seq) data is typically performed after subsetting to highly variable genes (HVGs). Here, we show that Entropy Sorting provides an alternative mathematical framework for feature selection. On synthetic datasets, continuous Entropy Sort Feature Weighting (cESFW) outperforms HVG selection in distinguishing cell-state-specific genes. We apply cESFW to six merged scRNA-seq datasets spanning human early embryo development. Without smoothing or augmenting the raw counts matrices, cESFW generates a high-resolution embedding displaying coherent developmental progression from eight-cell to post-implantation stages and delineating 15 distinct cell states. The embedding highlights sequential lineage decisions during blastocyst development, while unsupervised clustering identifies branch point populations obscured in previous analyses. The first branching region, where morula cells become specified for inner cell mass or trophectoderm, includes cells previously asserted to lack a developmental trajectory. We quantify the relatedness of different pluripotent stem cell cultures to distinct embryo cell types and identify marker genes of naïve and primed pluripotency. Finally, by revealing genes with dynamic lineage-specific expression, we provide markers for staging progression from morula to blastocyst.
单细胞转录组学(scRNA-seq)数据的分析通常在筛选出高变基因(HVG)后进行。在此,我们表明熵排序为特征选择提供了一种替代的数学框架。在合成数据集上,连续熵排序特征加权(cESFW)在区分细胞状态特异性基因方面优于HVG选择。我们将cESFW应用于六个涵盖人类早期胚胎发育的合并scRNA-seq数据集。在不对原始计数矩阵进行平滑或扩充的情况下,cESFW生成了一个高分辨率嵌入,展示了从八细胞期到植入后阶段的连贯发育进程,并描绘了15种不同的细胞状态。该嵌入突出了囊胚发育过程中的连续谱系决定,而无监督聚类则识别出了先前分析中被掩盖的分支点群体。第一个分支区域,即桑椹胚细胞被指定为内细胞团或滋养外胚层的区域,包括先前声称缺乏发育轨迹的细胞。我们量化了不同多能干细胞培养物与不同胚胎细胞类型的相关性,并鉴定了原始多能性和启动多能性的标记基因。最后,通过揭示具有动态谱系特异性表达的基因,我们提供了从桑椹胚到囊胚阶段进展的标记。