大数据的监督降维

Supervised dimensionality reduction for big data.

作者信息

Vogelstein Joshua T, Bridgeford Eric W, Tang Minh, Zheng Da, Douville Christopher, Burns Randal, Maggioni Mauro

机构信息

Johns Hopkins University, Baltimore, MD, USA.

出版信息

Nat Commun. 2021 May 17;12(1):2872. doi: 10.1038/s41467-021-23102-2.

DOI:10.1038/s41467-021-23102-2

PMID:34001899

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8129083/

Abstract

To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

摘要

为了解决关键的生物医学问题，实验人员现在通常对每个样本测量数百万或数十亿个特征（维度），希望数据科学技术能够构建准确的数据驱动推理。由于样本大小通常比这些数据的维度小几个数量级，有效的推理需要找到一种保留鉴别信息的低维表示（例如，个体是否患有特定疾病）。缺乏可扩展到数百万维度并具有强大统计理论保证的可解释监督降维方法。我们引入了一种通过将类条件矩估计纳入低维投影来扩展主成分分析的方法。最简单的版本，线性最优低秩投影，纳入了类条件均值。我们通过合成数据和真实数据基准证明并证实，线性最优低秩投影及其推广导致后续分类的数据表示得到改进，同时保持计算效率和可扩展性。使用由超过1.5亿个特征组成的多个脑成像数据集以及具有超过50万个特征的几个基因组数据集，线性最优低秩投影在准确性方面优于其他可扩展的线性降维技术，而在标准台式计算机上只需要几分钟。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad06/8129083/8ec1edc7091f/41467_2021_23102_Fig1_HTML.jpg

相似文献

Supervised dimensionality reduction for big data.

Nat Commun. 2021 May 17;12(1):2872. doi: 10.1038/s41467-021-23102-2.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

Robust Supervised Spline Embedding.

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6829-6842. doi: 10.1109/TNNLS.2024.3409394. Epub 2025 Apr 4.

Comparison among dimensionality reduction techniques based on Random Projection for cancer classification.

Comput Biol Chem. 2016 Dec;65:165-172. doi: 10.1016/j.compbiolchem.2016.09.010. Epub 2016 Sep 21.

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality.

R Soc Open Sci. 2020 Feb 5;7(2):190714. doi: 10.1098/rsos.190714. eCollection 2020 Feb.

A rotation based regularization method for semi-supervised learning.

Pattern Anal Appl. 2021;24(3):887-905. doi: 10.1007/s10044-020-00947-9. Epub 2021 Jan 4.

Bayesian supervised dimensionality reduction.

IEEE Trans Cybern. 2013 Dec;43(6):2179-89. doi: 10.1109/TCYB.2013.2245321.

SLIC Superpixel-Based -Norm Robust Principal Component Analysis for Hyperspectral Image Classification.

Sensors (Basel). 2019 Jan 24;19(3):479. doi: 10.3390/s19030479.

A Perception-Driven Approach to Supervised Dimensionality Reduction for Visualization.

IEEE Trans Vis Comput Graph. 2018 May;24(5):1828-1840. doi: 10.1109/TVCG.2017.2701829. Epub 2017 May 5.

Improving reduced-order models through nonlinear decoding of projection-dependent outputs.

Patterns (N Y). 2023 Oct 10;4(11):100859. doi: 10.1016/j.patter.2023.100859. eCollection 2023 Nov 10.

引用本文的文献

Omics landscapes in molecular mechanisms with as an example.

Food Chem (Oxf). 2025 Aug 25;11:100294. doi: 10.1016/j.fochms.2025.100294. eCollection 2025 Dec.

Hybrid Radiomics and Machine Learning for Brain Tumors Multi-Task Classification: An Exploratory Study on Integrating GLCM and Curvelet-Based Features for Enhanced Accuracy.

Health Sci Rep. 2025 Aug 20;8(8):e71195. doi: 10.1002/hsr2.71195. eCollection 2025 Aug.

Reducing phenotype-structured partial differential equations models of cancer evolution to systems of ordinary differential equations: a generalised moment dynamics approach.

J Math Biol. 2025 Jul 28;91(2):22. doi: 10.1007/s00285-025-02246-5.

WheatGP, a genomic prediction method based on CNN and LSTM.

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf191.

LD-informed deep learning for Alzheimer's gene loci detection using WGS data.

Alzheimers Dement (N Y). 2025 Jan 16;11(1):e70041. doi: 10.1002/trc2.70041. eCollection 2025 Jan-Mar.

LD-informed deep learning for Alzheimer's gene loci detection using WGS data.

medRxiv. 2024 Dec 12:2024.09.19.24313993. doi: 10.1101/2024.09.19.24313993.

CuPCA: a web server for pan-cancer association analysis of large-scale cuproptosis-related genes.

Database (Oxford). 2024 Sep 3;2024. doi: 10.1093/database/baae075.

Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data.

BMC Bioinformatics. 2024 Apr 26;25(1):167. doi: 10.1186/s12859-024-05795-6.

Improving accuracy of vascular access quality classification in hemodialysis patients using deep learning with K highest score feature selection.

J Int Med Res. 2024 Apr;52(4):3000605241232519. doi: 10.1177/03000605241232519.

Using slisemap to interpret physical data.

PLoS One. 2024 Jan 25;19(1):e0297714. doi: 10.1371/journal.pone.0297714. eCollection 2024.

本文引用的文献

A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA: HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX RECOVERY.

Ann Stat. 2021 Jun;49(3):1239-1266. doi: 10.1214/20-aos1980. Epub 2021 Aug 9.

Eliminating accidental deviations to minimize generalization error and maximize replicability: Applications in connectomics and genomics.

PLoS Comput Biol. 2021 Sep 16;17(9):e1009279. doi: 10.1371/journal.pcbi.1009279. eCollection 2021 Sep.

Assessing aneuploidy with repetitive element sequencing.

Proc Natl Acad Sci U S A. 2020 Mar 3;117(9):4858-4863. doi: 10.1073/pnas.1910041117. Epub 2020 Feb 19.

Science in the cloud (SIC): A use case in MRI connectomics.

Gigascience. 2017 May 1;6(5):1-10. doi: 10.1093/gigascience/gix013.

An open science resource for establishing reliability and reproducibility in functional connectomics.

Sci Data. 2014 Dec 9;1:140049. doi: 10.1038/sdata.2014.49. eCollection 2014.

Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning.

Science. 2014 Apr 25;344(6182):386-92. doi: 10.1126/science.1250298. Epub 2014 Mar 27.

Graph classification using signal-subgraphs: applications in statistical connectomics.

IEEE Trans Pattern Anal Mach Intell. 2013 Jul;35(7):1539-51. doi: 10.1109/TPAMI.2012.235.

A ROAD to Classification in High Dimensional Space.

J R Stat Soc Series B Stat Methodol. 2012 Sep;74(4):745-771. doi: 10.1111/j.1467-9868.2012.01029.x. Epub 2012 Apr 12.

Magnetic resonance connectome automated pipeline: an overview.

IEEE Pulse. 2012 Mar;3(2):42-8. doi: 10.1109/MPUL.2011.2181023.

Hierarchical topological network analysis of anatomical human brain connectivity and differences related to sex and kinship.

Neuroimage. 2012 Feb 15;59(4):3784-804. doi: 10.1016/j.neuroimage.2011.10.096. Epub 2011 Nov 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大数据的监督降维

Supervised dimensionality reduction for big data.

作者信息

Vogelstein Joshua T, Bridgeford Eric W, Tang Minh, Zheng Da, Douville Christopher, Burns Randal, Maggioni Mauro

机构信息

Johns Hopkins University, Baltimore, MD, USA.

出版信息

Nat Commun. 2021 May 17;12(1):2872. doi: 10.1038/s41467-021-23102-2.

DOI:10.1038/s41467-021-23102-2

PMID:34001899

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8129083/

Abstract

摘要

大数据的监督降维

Supervised dimensionality reduction for big data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

大数据的监督降维

Supervised dimensionality reduction for big data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献