Chen Zhaomeng, He Zihuai, Chu Benjamin B, Gu Jiaqi, Morrison Tim, Sabatti Chiara, Candès Emmanuel
Department of Statistics, Stanford University.
Department of Neurology and Neurological Sciences, Stanford University.
ArXiv. 2024 Feb 20:arXiv:2402.12724v1.
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.
在控制误报的同时确定哪些变量确实会影响响应,这在统计学和数据科学中普遍存在。在本文中,我们考虑一种情况,即我们只能获取汇总统计信息,例如每个潜在感兴趣的因变量与响应之间的边际经验相关性值。由于隐私问题,例如为了避免泄露敏感的遗传信息,可能会出现这种情况。我们扩展了GhostKnockoffs(He等人,[2022]),并引入了基于惩罚回归的变量选择方法,以实现错误发现率(FDR)控制。我们在广泛的模拟研究中报告了实证结果,表明与先前的工作相比性能有所提高。我们还将我们的方法应用于阿尔茨海默病的全基因组关联研究,并证明在功效方面有显著提高。