Sechidis Konstantinos, Brown Gavin
School of Computer Science, University of Manchester, Manchester, M13 9PL UK.
Mach Learn. 2018;107(2):357-395. doi: 10.1007/s10994-017-5648-2. Epub 2017 Jul 17.
What is the simplest thing you can do to solve a problem? In the context of semi-supervised feature selection, we tackle exactly this-how much we can gain from two simple strategies. If we have some binary labelled data and some unlabelled, we could assume the unlabelled data are all positives, or assume them all negatives. These minimalist, seemingly naive, approaches have not previously been studied in depth. However, with theoretical and empirical studies, we show they provide powerful results for feature selection, via hypothesis testing and feature ranking. Combining them with some "soft" prior knowledge of the domain, we derive two novel algorithms (-JMI, -IAMB) that outperform significantly more complex competing methods, showing particularly good performance when the labels are missing-not-at-random. We conclude that simple approaches to this problem can work surprisingly well, and in many situations we can provably recover the exact feature selection dynamics, .
为解决一个问题,你能做的最简单的事情是什么?在半监督特征选择的背景下,我们恰恰要解决这个问题——从两种简单策略中我们能获得多少收益。如果我们有一些二元标记数据和一些未标记数据,我们可以假设未标记数据全是正例,或者假设它们全是反例。这些极简主义的、看似天真的方法此前尚未得到深入研究。然而,通过理论和实证研究,我们表明它们通过假设检验和特征排序为特征选择提供了强大的结果。将它们与该领域的一些“软”先验知识相结合,我们推导出了两种新颖的算法(-JMI,-IAMB),它们的性能显著优于更复杂的竞争方法,在标签非随机缺失时表现尤为出色。我们得出结论,针对这个问题的简单方法可能会出奇地有效,并且在许多情况下我们可以证明能够恢复精确的特征选择动态。