Landa Boris, Coifman Ronald R, Kluger Yuval
Program in Applied Mathematics, Yale University.
Interdepartmental Program in Computational Biology and Bioinformatics, Yale University.
SIAM J Math Data Sci. 2021;3(1):388-413. doi: 10.1137/20M1342124. Epub 2021 Mar 23.
A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e., no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate , where is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavorably under heteroskedastic noise. Furthermore, we provide examples of simulated and experimental single-cell RNA sequence data with intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident.
许多数据分析技术的一个基本步骤是构建一个描述数据点之间相似性的亲和矩阵。当数据点位于欧几里得空间时,一种广泛采用的方法是通过高斯核与成对距离来构建亲和矩阵,并随后进行某种归一化(例如行随机归一化或其对称变体)。我们证明,主对角线为零(即无自环)的高斯核的双随机归一化对异方差噪声具有鲁棒性。也就是说,双随机归一化的优势在于它能自动考虑具有不同噪声方差的观测值。具体而言,我们证明,在一个合适的高维环境中,当异方差噪声在空间的任何特定方向上都不会过度集中时,所得的(双随机)噪声亲和矩阵以速率 收敛到其无噪声对应矩阵,其中 是环境维度。我们通过数值演示了这一结果,并表明相比之下,流行的行随机归一化和对称归一化在异方差噪声下表现不佳。此外,我们提供了具有内在异方差性的模拟和实验单细胞RNA序列数据的示例,其中双随机归一化在探索性分析中的优势显而易见。