Suppr超能文献

在存在噪声的情况下校准降维超参数。

Calibrating dimension reduction hyperparameters in the presence of noise.

机构信息

Department of Mathematics, Indiana University, Bloomington, Indiana, United States of America.

Department of Statistics, Indiana University, Bloomington, Indiana, United States of America.

出版信息

PLoS Comput Biol. 2024 Sep 12;20(9):e1012427. doi: 10.1371/journal.pcbi.1012427. eCollection 2024 Sep.

Abstract

The goal of dimension reduction tools is to construct a low-dimensional representation of high-dimensional data. These tools are employed for a variety of reasons such as noise reduction, visualization, and to lower computational costs. However, there is a fundamental issue that is discussed in other modeling problems that is often overlooked in dimension reduction-overfitting. In the context of other modeling problems, techniques such as feature-selection, cross-validation, and regularization are employed to combat overfitting, but rarely are such precautions taken when applying dimension reduction. Prior applications of the two most popular non-linear dimension reduction methods, t-SNE and UMAP, fail to acknowledge data as a combination of signal and noise when assessing performance. These methods are typically calibrated to capture the entirety of the data, not just the signal. In this paper, we demonstrate the importance of acknowledging noise when calibrating hyperparameters and present a framework that enables users to do so. We use this framework to explore the role hyperparameter calibration plays in overfitting the data when applying t-SNE and UMAP. More specifically, we show previously recommended values for perplexity and n_neighbors are too small and overfit the noise. We also provide a workflow others may use to calibrate hyperparameters in the presence of noise.

摘要

降维工具的目标是构建高维数据的低维表示。这些工具被用于各种原因,如降噪、可视化和降低计算成本。然而,在降维中,有一个在其他建模问题中讨论过但经常被忽视的基本问题,即过拟合。在其他建模问题中,会采用特征选择、交叉验证和正则化等技术来对抗过拟合,但在应用降维时很少采取这些预防措施。两种最流行的非线性降维方法 t-SNE 和 UMAP 的先前应用在评估性能时未能将数据视为信号和噪声的组合。这些方法通常经过校准以捕捉数据的全部内容,而不仅仅是信号。在本文中,我们展示了在调整超参数时承认噪声的重要性,并提出了一个框架,使用户能够做到这一点。我们使用这个框架来探索在应用 t-SNE 和 UMAP 时,超参数校准在数据过拟合中所扮演的角色。具体来说,我们表明先前推荐的困惑度和 n_neighbors 值太小,会过拟合噪声。我们还提供了一个工作流程,其他人可以在存在噪声的情况下使用该流程来校准超参数。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e85a/11421778/fee9baddf04a/pcbi.1012427.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验