Sudarshan Mukund, Tansey Wesley, Ranganath Rajesh
Courant Institute of Mathematical Sciences, New York University.
Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center.
Adv Neural Inf Process Syst. 2020 Dec;33:5036-5046.
Predictive modeling often uses black box machine learning methods, such as deep neural networks, to achieve state-of-the-art performance. In scientific domains, the scientist often wishes to discover which features are actually important for making the predictions. These discoveries may lead to costly follow-up experiments and as such it is important that the error rate on discoveries is not too high. Model-X knockoffs [2] enable important features to be discovered with control of the false discovery rate (fdr). However, knockoffs require rich generative models capable of accurately modeling the knockoff features while ensuring they obey the so-called "swap" property. We develop Deep Direct Likelihood Knockoffs (ddlk), which directly minimizes the KL divergence implied by the knockoff swap property. ddlk consists of two stages: it first maximizes the explicit likelihood of the features, then minimizes the KL divergence between the joint distribution of features and knockoffs and any swap between them. To ensure that the generated knockoffs are valid under any possible swap, ddlk uses the Gumbel-Softmax trick to optimize the knockoff generator under the worst-case swap. We find ddlk has higher power than baselines while controlling the false discovery rate on a variety of synthetic and real benchmarks including a task involving a large dataset from one of the epicenters of COVID-19.
预测建模通常使用黑箱机器学习方法,如深度神经网络,以实现最先进的性能。在科学领域,科学家通常希望发现哪些特征对于进行预测实际上是重要的。这些发现可能会导致代价高昂的后续实验,因此发现的错误率不能过高这一点很重要。模型X仿制品[2]能够在控制错误发现率(fdr)的情况下发现重要特征。然而,仿制品需要丰富的生成模型,能够准确地对仿制品特征进行建模,同时确保它们遵循所谓的“交换”属性。我们开发了深度直接似然仿制品(ddlk),它直接最小化了由仿制品交换属性所隐含的KL散度。ddlk由两个阶段组成:它首先最大化特征的显式似然,然后最小化特征和仿制品的联合分布之间以及它们之间任何交换的KL散度。为了确保生成的仿制品在任何可能的交换下都是有效的,ddlk使用Gumbel-Softmax技巧在最坏情况交换下优化仿制品生成器。我们发现,在包括一项涉及来自COVID-19一个疫情中心的大型数据集的任务在内的各种合成和真实基准测试中,ddlk在控制错误发现率的同时比基线具有更高的功效。