Little Anna, Xie Yuying, Sun Qiang
Department of Mathematics, Utah Center for Data Science, University of Utah, Salt Lake City, UT 84112, USA.
Department of Computational Mathematics, Science and Engineering, Department of Statistics, Michigan State University, East Lansing, MI 48824, USA.
Inf inference. 2022 Apr 23;12(1):72-112. doi: 10.1093/imaiai/iaac004. eCollection 2023 Mar.
Classical multidimensional scaling is a widely used dimension reduction technique. Yet few theoretical results characterizing its statistical performance exist. This paper provides a theoretical framework for analyzing the quality of embedded samples produced by classical multidimensional scaling. This lays a foundation for various downstream statistical analyses, and we focus on clustering noisy data. Our results provide scaling conditions on the signal-to-noise ratio under which classical multidimensional scaling followed by a distance-based clustering algorithm can recover the cluster labels of all samples. Simulation studies confirm these scaling conditions are sharp. Applications to the cancer gene-expression data, the single-cell RNA sequencing data and the natural language data lend strong support to the methodology and theory.
经典多维缩放是一种广泛使用的降维技术。然而,表征其统计性能的理论结果却很少。本文提供了一个理论框架,用于分析经典多维缩放产生的嵌入样本的质量。这为各种下游统计分析奠定了基础,并且我们专注于对噪声数据进行聚类。我们的结果给出了信噪比的缩放条件,在该条件下,采用基于距离的聚类算法的经典多维缩放可以恢复所有样本的聚类标签。模拟研究证实这些缩放条件是精确的。对癌症基因表达数据、单细胞RNA测序数据和自然语言数据的应用为该方法和理论提供了有力支持。