PUlasso：仅存在数据下的高维变量选择

PUlasso: High-Dimensional Variable Selection With Presence-Only Data.

作者信息

Song Hyebin, Raskutti Garvesh

机构信息

Department of Statistics, University of Wisconsin-Madison, Madison, WI.

出版信息

J Am Stat Assoc. 2019;115(529):334-347. doi: 10.1080/01621459.2018.1546587. Epub 2019 Apr 11.

DOI:10.1080/01621459.2018.1546587

PMID:32255883

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7133715/

Abstract

In various real-world problems, we are presented with classification problems with , referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates is large. The combination of and presents both statistical and computational challenges. In this article, we develop the algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online.

摘要

在各种实际问题中，我们会遇到分类问题，其响应仅表示为存在，即所谓的仅存在响应。在本文中，我们研究在仅存在响应的背景下进行变量选择，其中特征或协变量的数量很大。特征数量大与仅存在响应的结合带来了统计和计算方面的挑战。在本文中，我们开发了用于具有正例和未标记响应的变量选择与分类的算法。我们的算法涉及使用主元化-最小化框架，该框架是著名的期望最大化（EM）算法的推广。特别是为了使我们的算法具有可扩展性，我们为标准EM算法提供了两种计算加速方法。我们提供了理论保证，首先表明我们的算法收敛到一个驻点，然后证明在严格稀疏性和组稀疏性假设下，真实参数局部邻域内的任何驻点都能达到极小极大最优均方误差。我们还通过模拟证明，在中等设置下，我们的算法在分类性能方面优于现有算法。最后，我们证明我们的PUlasso算法在一个生物化学示例上表现良好。本文的补充材料可在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/739c/7133715/3f1a37478b52/nihms-1028702-f0001.jpg

相似文献

PUlasso: High-Dimensional Variable Selection With Presence-Only Data.PUlasso：仅存在数据下的高维变量选择

J Am Stat Assoc. 2019;115(529):334-347. doi: 10.1080/01621459.2018.1546587. Epub 2019 Apr 11.

A Semismooth Newton Algorithm for High-Dimensional Nonconvex Sparse Learning.一种用于高维非凸稀疏学习的半光滑牛顿算法。

IEEE Trans Neural Netw Learn Syst. 2020 Aug;31(8):2993-3006. doi: 10.1109/TNNLS.2019.2935001. Epub 2019 Sep 12.

Effective noise-suppressed and artifact-reduced reconstruction of SPECT data using a preconditioned alternating projection algorithm.使用预处理交替投影算法对SPECT数据进行有效的噪声抑制和伪影减少重建。

Med Phys. 2015 Aug;42(8):4872-87. doi: 10.1118/1.4926846.

High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality.高维期望最大化算法：统计优化与渐近正态性

Adv Neural Inf Process Syst. 2015;28:2512-2520.

Nested Conjugate Gradient Algorithm With Nested Preconditioning for Non-Linear Image Restoration.带嵌套预处理的嵌套共轭梯度算法在非线性图像恢复中的应用。

IEEE Trans Image Process. 2017 Sep;26(9):4471-4482. doi: 10.1109/TIP.2017.2717182. Epub 2017 Jun 19.

An imputation-regularized optimization algorithm for high dimensional missing data problems and beyond.一种用于高维缺失数据问题及其他问题的插补正则化优化算法。

J R Stat Soc Series B Stat Methodol. 2018 Nov;80(5):899-926. doi: 10.1111/rssb.12279. Epub 2018 Jun 25.

Majorization Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models.基于坐标下降法的凹惩罚广义线性模型的优化最小化

Stat Comput. 2014 Sep;24(5):871-883. doi: 10.1007/s11222-013-9407-3.

Efficient Training for Positive Unlabeled Learning.正例无标注学习的高效训练

IEEE Trans Pattern Anal Mach Intell. 2019 Nov;41(11):2584-2598. doi: 10.1109/TPAMI.2018.2860995. Epub 2018 Jul 30.

Latent variable sdelection in multidimensional item response theory models using the expectation model selection algorithm.使用期望模型选择算法在多维项目反应理论模型中进行潜在变量选择

Br J Math Stat Psychol. 2022 May;75(2):363-394. doi: 10.1111/bmsp.12261. Epub 2021 Dec 17.

An Online Minimax Optimal Algorithm for Adversarial Multiarmed Bandit Problem.一种用于对抗性多臂老虎机问题的在线极小极大最优算法。

IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5565-5580. doi: 10.1109/TNNLS.2018.2806006. Epub 2018 Mar 8.

引用本文的文献

Hierarchical Multi-Label Classification With Gene-Environment Interactions in Disease Modeling.疾病建模中基于基因-环境相互作用的分层多标签分类

Stat Med. 2025 Feb 10;44(3-4):e10330. doi: 10.1002/sim.10330.

Probabilistic HIV recency classification-a logistic regression without labeled individual level training data.概率性HIV近期感染分类——一种无需个体层面标记训练数据的逻辑回归方法。

Ann Appl Stat. 2023 Mar;17(1):108-129. doi: 10.1214/22-aoas1618. Epub 2023 Jan 24.

PLUS: Predicting cancer metastasis potential based on positive and unlabeled learning.PLUS：基于阳性和无标签学习预测癌症转移潜能。

PLoS Comput Biol. 2022 Mar 29;18(3):e1009956. doi: 10.1371/journal.pcbi.1009956. eCollection 2022 Mar.

Microfluidic deep mutational scanning of the human executioner caspases reveals differences in structure and regulation.对人类刽子手半胱天冬酶进行微流控深度突变扫描揭示了结构和调控方面的差异。

Cell Death Discov. 2022 Jan 10;8(1):7. doi: 10.1038/s41420-021-00799-0.

A semi-supervised model to predict regulatory effects of genetic variants at single nucleotide resolution using massively parallel reporter assays.一种使用大规模平行报告基因实验，在单核苷酸分辨率下预测遗传变异调控效应的半监督模型。

Bioinformatics. 2021 Aug 4;37(14):1953–1962. doi: 10.1093/bioinformatics/btab040. Epub 2021 Jan 30.

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.基于大规模正无标签学习推断蛋白质序列-功能关系。

Cell Syst. 2021 Jan 20;12(1):92-101.e8. doi: 10.1016/j.cels.2020.10.007. Epub 2020 Nov 18.

Bayesian Neural Networks for Selection of Drug Sensitive Genes.用于选择药物敏感基因的贝叶斯神经网络

J Am Stat Assoc. 2018;113(523):955-972. doi: 10.1080/01621459.2017.1409122. Epub 2018 Jun 28.

本文引用的文献

STANDARDIZATION AND THE GROUP LASSO PENALTY.标准化与组套索惩罚

Stat Sin. 2012 Jul;22(3):983-1001. doi: 10.5705/ss.2011.075.

Dissecting enzyme function with microfluidic-based deep mutational scanning.利用基于微流控的深度突变扫描剖析酶功能。

Proc Natl Acad Sci U S A. 2015 Jun 9;112(23):7159-64. doi: 10.1073/pnas.1422285112. Epub 2015 May 26.

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors.具有分组预测变量的非凸惩罚线性和逻辑回归模型的分组下降算法。

Stat Comput. 2015 Mar;25(2):173-187. doi: 10.1007/s11222-013-9424-2.

Strong rules for discarding predictors in lasso-type problems.在套索型问题中舍弃预测变量的严格规则。

J R Stat Soc Series B Stat Methodol. 2012 Mar;74(2):245-266. doi: 10.1111/j.1467-9868.2011.01004.x.

Deep mutational scanning: a new style of protein science.深度突变扫描：一种新的蛋白质科学研究方法。

Nat Methods. 2014 Aug;11(8):801-7. doi: 10.1038/nmeth.3027.

A Selective Review of Group Selection in High-Dimensional Models.高维模型中群体选择的选择性综述。

Stat Sci. 2012;27(4). doi: 10.1214/12-STS392.

COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION.用于非凸惩罚回归的坐标下降算法及其在生物特征选择中的应用

Ann Appl Stat. 2011 Jan 1;5(1):232-253. doi: 10.1214/10-AOAS388.

Experimental illumination of a fitness landscape.实验照亮适应度景观。

Proc Natl Acad Sci U S A. 2011 May 10;108(19):7896-901. doi: 10.1073/pnas.1016024108. Epub 2011 Apr 4.

Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径

J Stat Softw. 2010;33(1):1-22.

Presence-only data and the em algorithm.仅存在数据与期望最大化算法

Biometrics. 2009 Jun;65(2):554-63. doi: 10.1111/j.1541-0420.2008.01116.x.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验