Sharp-SSL：用于半监督学习的选择性高维轴对齐随机投影

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

作者信息

Wang Tengyao, Dobriban Edgar, Gataric Milana, Samworth Richard J

机构信息

Department of Statistics, London School of Economics, London, UK.

Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA.

出版信息

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

DOI:10.1080/01621459.2024.2340792

PMID:40264988

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12012707/

Abstract

We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. To this end, we score projections according to their class-distinguishing ability; for instance, motivated by a generalized Rayleigh quotient, we can compute the traces of estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the diagonal entries of the whitened between-class covariance matrix sufficiently well. For the Gaussian EM base procedure, we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

摘要

我们提出了一种用于高维半监督学习问题的新方法，该方法基于对应用于数据的多个轴对齐随机投影的低维过程结果的仔细汇总。我们的主要目标是识别区分不同类别的重要变量；然后可以应用现有的低维方法进行最终的类别分配。为此，我们根据投影的类别区分能力对其进行评分；例如，受广义瑞利商的启发，我们可以计算投影数据上估计的白化类间协方差矩阵的迹。这使我们能够为给定投影的每个变量分配一个重要性权重，并通过在高分投影上汇总这些权重来选择我们的信号变量。我们的理论表明，当我们在足够多的随机投影上进行汇总并且基础过程能够充分准确地估计白化类间协方差矩阵的对角元素时，由此产生的Sharp-SSL算法能够以高概率恢复信号坐标。对于高斯期望最大化（EM）基础过程，我们对半监督设置下的性能进行了新的分析，该分析根据样本中标记数据的比例来控制参数估计误差。在模拟数据和真实结肠肿瘤数据集上的数值结果支持了该方法出色的实证性能。本文的补充材料可在线获取，包括用于重现该工作的可用材料的标准化描述。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9b4/12012707/b81ee109ed1b/UASA_A_2340792_F0001_C.jpg

相似文献

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

Semi-supervised learning framework with shape encoding for neonatal ventricular segmentation from 3D ultrasound.

Med Phys. 2024 Sep;51(9):6134-6148. doi: 10.1002/mp.17242. Epub 2024 Jun 10.

Comparing supervised and semi-supervised Machine Learning Models on Diagnosing Breast Cancer.

Ann Med Surg (Lond). 2021 Jan 8;62:53-64. doi: 10.1016/j.amsu.2020.12.043. eCollection 2021 Feb.

Semi-supervised Long-tail Endoscopic Image Classification.

Chin Med Sci J. 2022 Sep 30;37(3):171-180. doi: 10.24920/004135.

Semi-supervised abdominal multi-organ segmentation by object-redrawing.

Med Phys. 2024 Nov;51(11):8334-8347. doi: 10.1002/mp.17364. Epub 2024 Aug 21.

Efficient Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling.

J R Stat Soc Series B Stat Methodol. 2022 Sep;84(4):1353-1391. doi: 10.1111/rssb.12502. Epub 2022 Apr 26.

A semi-supervised algorithm for improving the consistency of crowdsourced datasets: The COVID-19 case study on respiratory disorder classification.

Comput Methods Programs Biomed. 2023 Nov;241:107743. doi: 10.1016/j.cmpb.2023.107743. Epub 2023 Aug 9.

Semi-supervised oblique predictive clustering trees.

PeerJ Comput Sci. 2021 May 3;7:e506. doi: 10.7717/peerj-cs.506. eCollection 2021.

SemiContour: A Semi-supervised Learning Approach for Contour Detection.

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2016 Jun;2016:251-259. doi: 10.1109/CVPR.2016.34. Epub 2016 Dec 12.

Towards a Theoretical Understanding of Semi-Supervised Learning Under Class Distribution Mismatch.

IEEE Trans Pattern Anal Mach Intell. 2025 Jun;47(6):4853-4868. doi: 10.1109/TPAMI.2025.3545930. Epub 2025 May 7.

本文引用的文献

How to reduce dimension with PCA and random projections?

IEEE Trans Inf Theory. 2021 Dec;67(12):8154-8189. doi: 10.1109/tit.2021.3112821. Epub 2021 Sep 14.

Statistical properties of sketching algorithms.

Biometrika. 2021 Jun;108(2):283-297. doi: 10.1093/biomet/asaa062. Epub 2020 Jul 30.

Clusterability and Clustering of Images and Other "Real" High-Dimensional Data.

IEEE Trans Image Process. 2018 Apr;27(4):1927-1938. doi: 10.1109/TIP.2017.2789327.

Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis.

Med Image Anal. 2019 May;54:280-296. doi: 10.1016/j.media.2019.03.009. Epub 2019 Mar 29.

Clustering algorithms: A comparative approach.

PLoS One. 2019 Jan 15;14(1):e0210236. doi: 10.1371/journal.pone.0210236. eCollection 2019.

Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Nat Biotechnol. 2018 Jun;36(5):411-420. doi: 10.1038/nbt.4096. Epub 2018 Apr 2.

Penalized classification using Fisher's linear discriminant.

J R Stat Soc Series B Stat Methodol. 2011 Nov;73(5):753-772. doi: 10.1111/j.1467-9868.2011.00783.x.

A framework for feature selection in clustering.

J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.

Clustering cancer gene expression data: a comparative study.

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

Survey of clustering algorithms.

IEEE Trans Neural Netw. 2005 May;16(3):645-78. doi: 10.1109/TNN.2005.845141.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Sharp-SSL：用于半监督学习的选择性高维轴对齐随机投影

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献