Suppr超能文献

Sharp-SSL:用于半监督学习的选择性高维轴对齐随机投影

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

作者信息

Wang Tengyao, Dobriban Edgar, Gataric Milana, Samworth Richard J

机构信息

Department of Statistics, London School of Economics, London, UK.

Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA.

出版信息

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

Abstract

We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. To this end, we score projections according to their class-distinguishing ability; for instance, motivated by a generalized Rayleigh quotient, we can compute the traces of estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the diagonal entries of the whitened between-class covariance matrix sufficiently well. For the Gaussian EM base procedure, we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

摘要

我们提出了一种用于高维半监督学习问题的新方法,该方法基于对应用于数据的多个轴对齐随机投影的低维过程结果的仔细汇总。我们的主要目标是识别区分不同类别的重要变量;然后可以应用现有的低维方法进行最终的类别分配。为此,我们根据投影的类别区分能力对其进行评分;例如,受广义瑞利商的启发,我们可以计算投影数据上估计的白化类间协方差矩阵的迹。这使我们能够为给定投影的每个变量分配一个重要性权重,并通过在高分投影上汇总这些权重来选择我们的信号变量。我们的理论表明,当我们在足够多的随机投影上进行汇总并且基础过程能够充分准确地估计白化类间协方差矩阵的对角元素时,由此产生的Sharp-SSL算法能够以高概率恢复信号坐标。对于高斯期望最大化(EM)基础过程,我们对半监督设置下的性能进行了新的分析,该分析根据样本中标记数据的比例来控制参数估计误差。在模拟数据和真实结肠肿瘤数据集上的数值结果支持了该方法出色的实证性能。本文的补充材料可在线获取,包括用于重现该工作的可用材料的标准化描述。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9b4/12012707/b81ee109ed1b/UASA_A_2340792_F0001_C.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验