Department of Mathematics, Indiana University, Bloomington, Indiana, United States of America.
Department of Statistics, Indiana University, Bloomington, Indiana, United States of America.
PLoS Comput Biol. 2024 Sep 12;20(9):e1012427. doi: 10.1371/journal.pcbi.1012427. eCollection 2024 Sep.
The goal of dimension reduction tools is to construct a low-dimensional representation of high-dimensional data. These tools are employed for a variety of reasons such as noise reduction, visualization, and to lower computational costs. However, there is a fundamental issue that is discussed in other modeling problems that is often overlooked in dimension reduction-overfitting. In the context of other modeling problems, techniques such as feature-selection, cross-validation, and regularization are employed to combat overfitting, but rarely are such precautions taken when applying dimension reduction. Prior applications of the two most popular non-linear dimension reduction methods, t-SNE and UMAP, fail to acknowledge data as a combination of signal and noise when assessing performance. These methods are typically calibrated to capture the entirety of the data, not just the signal. In this paper, we demonstrate the importance of acknowledging noise when calibrating hyperparameters and present a framework that enables users to do so. We use this framework to explore the role hyperparameter calibration plays in overfitting the data when applying t-SNE and UMAP. More specifically, we show previously recommended values for perplexity and n_neighbors are too small and overfit the noise. We also provide a workflow others may use to calibrate hyperparameters in the presence of noise.
降维工具的目标是构建高维数据的低维表示。这些工具被用于各种原因,如降噪、可视化和降低计算成本。然而,在降维中,有一个在其他建模问题中讨论过但经常被忽视的基本问题,即过拟合。在其他建模问题中,会采用特征选择、交叉验证和正则化等技术来对抗过拟合,但在应用降维时很少采取这些预防措施。两种最流行的非线性降维方法 t-SNE 和 UMAP 的先前应用在评估性能时未能将数据视为信号和噪声的组合。这些方法通常经过校准以捕捉数据的全部内容,而不仅仅是信号。在本文中,我们展示了在调整超参数时承认噪声的重要性,并提出了一个框架,使用户能够做到这一点。我们使用这个框架来探索在应用 t-SNE 和 UMAP 时,超参数校准在数据过拟合中所扮演的角色。具体来说,我们表明先前推荐的困惑度和 n_neighbors 值太小,会过拟合噪声。我们还提供了一个工作流程,其他人可以在存在噪声的情况下使用该流程来校准超参数。