Tong Alexander, Huguet Guillaume, Natik Amine, MacDonald Kincaid, Kuchroo Manik, Coifman Ronald, Wolf Guy, Krishnaswamy Smita
ArXiv. 2021 Feb 25:arXiv:2102.12833v2.
We propose a new fast method of measuring distances between large numbers of related high dimensional datasets called the Diffusion Earth Mover's Distance (EMD). We model the datasets as distributions supported on common data graph that is derived from the affinity matrix computed on the combined data. In such cases where the graph is a discretization of an underlying Riemannian closed manifold, we prove that Diffusion EMD is topologically equivalent to the standard EMD with a geodesic ground distance. Diffusion EMD can be computed in $\tilde{O}(n)$ time and is more accurate than similarly fast algorithms such as tree-based EMDs. We also show Diffusion EMD is fully differentiable, making it amenable to future uses in gradient-descent frameworks such as deep neural networks. Finally, we demonstrate an application of Diffusion EMD to single cell data collected from 210 COVID-19 patient samples at Yale New Haven Hospital. Here, Diffusion EMD can derive distances between patients on the manifold of cells at least two orders of magnitude faster than equally accurate methods. This distance matrix between patients can be embedded into a higher level patient manifold which uncovers structure and heterogeneity in patients. More generally, Diffusion EMD is applicable to all datasets that are massively collected in parallel in many medical and biological systems.
我们提出了一种新的快速方法,用于测量大量相关高维数据集之间的距离,称为扩散地球移动距离(EMD)。我们将数据集建模为支持在公共数据图上的分布,该公共数据图源自对组合数据计算的亲和矩阵。在图是底层黎曼闭合流形的离散化的情况下,我们证明扩散EMD在拓扑上等同于具有测地地面距离的标准EMD。扩散EMD可以在$\tilde{O}(n)$时间内计算,并且比类似的快速算法(如基于树的EMD)更准确。我们还表明扩散EMD是完全可微的,使其适用于未来在梯度下降框架(如深度神经网络)中的应用。最后,我们展示了扩散EMD在从耶鲁纽黑文医院收集的210个COVID-19患者样本的单细胞数据中的应用。在这里,扩散EMD可以在细胞流形上得出患者之间的距离,比同等准确的方法至少快两个数量级。患者之间的这个距离矩阵可以嵌入到更高层次的患者流形中,从而揭示患者的结构和异质性。更一般地说,扩散EMD适用于在许多医学和生物系统中并行大量收集的所有数据集。