Albergante Luca, Mirkes Evgeny, Bac Jonathan, Chen Huidong, Martin Alexis, Faure Louis, Barillot Emmanuel, Pinello Luca, Gorban Alexander, Zinovyev Andrei
Institut Curie, PSL Research University, 75005 Paris, France.
INSERM U900, 75248 Paris, France.
Entropy (Basel). 2020 Mar 4;22(3):296. doi: 10.3390/e22030296.
Multidimensional datapoint clouds representing large datasets are frequently characterized by non-trivial low-dimensional geometry and topology which can be recovered by unsupervised machine learning approaches, in particular, by principal graphs. Principal graphs approximate the multivariate data by a graph injected into the data space with some constraints imposed on the node mapping. Here we present ElPiGraph, a scalable and robust method for constructing principal graphs. ElPiGraph exploits and further develops the concept of elastic energy, the topological graph grammar approach, and a gradient descent-like optimization of the graph topology. The method is able to withstand high levels of noise and is capable of approximating data point clouds via principal graph ensembles. This strategy can be used to estimate the statistical significance of complex data features and to summarize them into a single consensus principal graph. ElPiGraph deals efficiently with large datasets in various fields such as biology, where it can be used for example with single-cell transcriptomic or epigenomic datasets to infer gene expression dynamics and recover differentiation landscapes.
表示大型数据集的多维数据点云通常具有非平凡的低维几何和拓扑特征,这些特征可以通过无监督机器学习方法恢复,特别是通过主图。主图通过注入数据空间并对节点映射施加一些约束的图来近似多变量数据。在这里,我们提出了ElPiGraph,这是一种用于构建主图的可扩展且稳健的方法。ElPiGraph利用并进一步发展了弹性能量的概念、拓扑图文法方法以及类似梯度下降的图拓扑优化。该方法能够承受高水平的噪声,并且能够通过主图集近似数据点云。这种策略可用于估计复杂数据特征的统计显著性,并将它们总结为单个一致的主图。ElPiGraph可以有效地处理生物学等各个领域的大型数据集,例如它可以与单细胞转录组或表观基因组数据集一起使用,以推断基因表达动态并恢复分化景观。