Hoekzema Renee S, Marsh Lewis, Sumray Otto, Carroll Thomas M, Lu Xin, Byrne Helen M, Harrington Heather A
Mathematical Institute, University of Oxford, Oxford OX1 2JD, UK.
Department of Mathematics, Free University of Amsterdam, 1081 HV Amsterdam, The Netherlands.
Entropy (Basel). 2022 Aug 13;24(8):1116. doi: 10.3390/e24081116.
Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores (eigi) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the Laplacian graph. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing the separation of genes with different roles in a bifurcation process (e.g., pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional biologically meaningful genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.
单细胞转录组学分析通常依赖于对细胞进行聚类,然后进行差异基因表达(DGE)分析,以识别这些聚类之间存在差异的基因。这些离散分析成功地确定了细胞类型和标记;然而,细胞类型内部和之间的连续变化可能无法被检测到。我们提出了三种基于拓扑学的无监督特征选择数学方法,这些方法在多个尺度上同时平等地考虑离散和连续的转录模式。特征分数(eigi)使用拉普拉斯图的谱分解,根据信号或基因与数据中低频固有模式的对应关系对其进行排序。多尺度拉普拉斯分数(MLS)是一种无监督方法,用于在数据中定位相关尺度,并选择在这些相应尺度上一致表达的基因。持久瑞利商(PRQ)处理配备过滤的数据,允许在分叉过程(例如,伪时间)中分离具有不同作用的基因。我们通过将这些技术应用于已发表的单细胞转录组学数据集来证明其效用。这些方法验证了先前鉴定的基因,并检测到具有一致表达模式的其他生物学上有意义的基因。通过研究基因信号与基础空间几何结构之间的相互作用,这三种方法给出了基因的多维排名以及它们之间关系的可视化。