Li Jingyi Jessica, Chen Yiling Elaine, Tong Xin
Department of Statistics, University of California, Los Angeles.
Department of Data Sciences and Operations, Marshall Business School, University of Southern California.
J Mach Learn Res. 2021 May;22.
Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.
尽管有众多用于联合特征建模的统计和机器学习工具,但许多科学家仍逐一地对特征进行边际研究,即一次只研究一个特征。部分原因在于训练和惯例,但也源于科学家对简单可视化和可解释性的浓厚兴趣。因此,在科学发现过程中,对某些预测任务(例如癌症驱动基因的预测)进行边际特征排序的做法很普遍。在这项工作中,我们专注于二分类的边际排序,这是最常见的预测任务之一。我们认为,包括皮尔逊相关、两样本t检验和两样本威尔科克森秩和检验在内的最广泛使用的边际排序标准,并未充分考虑特征分布和预测目标。为了在实践中弥补这一差距,我们针对两个预测目标提出了两个排序标准:经典标准(CC)和奈曼 - 皮尔逊标准(NPC),这两个标准都使用无模型的非参数实现方式来适应不同的特征分布。从理论上讲,我们表明在正则条件下,这两个标准都能实现样本级排序,且与它们的总体级对应标准具有很高的概率一致性。此外,当样本中的两类比例与总体中的比例不同时,NPC对抽样偏差具有鲁棒性。这一特性使NPC在抽样偏差普遍存在的生物医学研究中具有良好的潜力。我们在模拟和实际数据研究中展示了CC和NPC的使用方法及相对优势。我们基于无模型目标的排序思想可扩展到对特征子集进行排序,并可推广到其他预测任务和学习目标。