Song Hyebin, Raskutti Garvesh
Department of Statistics, University of Wisconsin-Madison, Madison, WI.
J Am Stat Assoc. 2019;115(529):334-347. doi: 10.1080/01621459.2018.1546587. Epub 2019 Apr 11.
In various real-world problems, we are presented with classification problems with , referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates is large. The combination of and presents both statistical and computational challenges. In this article, we develop the algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online.
在各种实际问题中,我们会遇到分类问题,其响应仅表示为存在,即所谓的仅存在响应。在本文中,我们研究在仅存在响应的背景下进行变量选择,其中特征或协变量的数量很大。特征数量大与仅存在响应的结合带来了统计和计算方面的挑战。在本文中,我们开发了用于具有正例和未标记响应的变量选择与分类的算法。我们的算法涉及使用主元化-最小化框架,该框架是著名的期望最大化(EM)算法的推广。特别是为了使我们的算法具有可扩展性,我们为标准EM算法提供了两种计算加速方法。我们提供了理论保证,首先表明我们的算法收敛到一个驻点,然后证明在严格稀疏性和组稀疏性假设下,真实参数局部邻域内的任何驻点都能达到极小极大最优均方误差。我们还通过模拟证明,在中等设置下,我们的算法在分类性能方面优于现有算法。最后,我们证明我们的PUlasso算法在一个生物化学示例上表现良好。本文的补充材料可在线获取。