Silva Jorge, Willett Rebecca
Duke University, Durham, NC 27708, USA.
IEEE Trans Pattern Anal Mach Intell. 2009 Mar;31(3):563-9. doi: 10.1109/TPAMI.2008.232.
This paper addresses the problem of detecting anomalous multivariate co-occurrences using a limited number of unlabeled training observations. A novel method based on using a hypergraph representation of the data is proposed to deal with this very high-dimensional problem. Hypergraphs constitute an important extension of graphs which allow edges to connect more than two vertices simultaneously. A variational Expectation-Maximization algorithm for detecting anomalies directly on the hypergraph domain without any feature selection or dimensionality reduction is presented. The resulting estimate can be used to calculate a measure of anomalousness based on the False Discovery Rate. The algorithm has O(np) computational complexity, where n is the number of training observations and p is the number of potential participants in each co-occurrence event. This efficiency makes the method ideally suited for very high-dimensional settings, and requires no tuning, bandwidth or regularization parameters. The proposed approach is validated on both high-dimensional synthetic data and the Enron email database, where p > 75,000, and it is shown that it can outperform other state-of-the-art methods.
本文探讨了使用有限数量的未标记训练观测值来检测异常多元共现的问题。提出了一种基于使用数据的超图表示的新颖方法来处理这个高维问题。超图是图的重要扩展,它允许边同时连接两个以上的顶点。提出了一种变分期望最大化算法,用于在超图域上直接检测异常,无需任何特征选择或降维。所得估计可用于基于错误发现率计算异常度量。该算法具有O(np)的计算复杂度,其中n是训练观测值的数量,p是每个共现事件中潜在参与者的数量。这种效率使该方法非常适合高维设置,并且不需要调整、带宽或正则化参数。所提出的方法在高维合成数据和安然电子邮件数据库(其中p>75,000)上都得到了验证,结果表明它可以优于其他现有方法。