Chen Yixin, Dang Xin, Peng Hanxiang, Bart Henry L
Department of Computer and Information Science, University of Mississippi, University, MS 38677, USA.
IEEE Trans Pattern Anal Mach Intell. 2009 Feb;31(2):288-305. doi: 10.1109/TPAMI.2008.72.
Statistical depth functions provide from the "deepest" point a "center-outward ordering" of multidimensional data. In this sense, depth functions can measure the "extremeness" or "outlyingness" of a data point with respect to a given data set. Hence, they can detect outliers--observations that appear extreme relative to the rest of the observations. Of the various statistical depths, the spatial depth is especially appealing because of its computational efficiency and mathematical tractability. In this article, we propose a novel statistical depth, the kernelized spatial depth (KSD), which generalizes the spatial depth via positive definite kernels. By choosing a proper kernel, the KSD can capture the local structure of a data set while the spatial depth fails. We demonstrate this by the half-moon data and the ring-shaped data. Based on the KSD, we propose a novel outlier detection algorithm, by which an observation with a depth value less than a threshold is declared as an outlier. The proposed algorithm is simple in structure: the threshold is the only one parameter for a given kernel. It applies to a one-class learning setting, in which "normal" observations are given as the training data, as well as to a missing label scenario, where the training set consists of a mixture of normal observations and outliers with unknown labels. We give upper bounds on the false alarm probability of a depth-based detector. These upper bounds can be used to determine the threshold. We perform extensive experiments on synthetic data and data sets from real applications. The proposed outlier detector is compared with existing methods. The KSD outlier detector demonstrates a competitive performance.
统计深度函数从“最深”点提供多维数据的“从中心向外排序”。从这个意义上说,深度函数可以衡量一个数据点相对于给定数据集的“极端性”或“离群性”。因此,它们可以检测异常值——相对于其他观测值显得极端的观测值。在各种统计深度中,空间深度因其计算效率和数学易处理性而特别有吸引力。在本文中,我们提出了一种新颖的统计深度,即核化空间深度(KSD),它通过正定核推广了空间深度。通过选择合适的核,KSD可以捕捉数据集的局部结构,而空间深度则无法做到这一点。我们通过半月形数据和环形数据证明了这一点。基于KSD,我们提出了一种新颖的异常值检测算法,通过该算法,深度值小于阈值的观测值被声明为异常值。所提出的算法结构简单:对于给定的核,阈值是唯一的参数。它适用于一类学习设置,其中“正常”观测值作为训练数据,也适用于缺失标签的情况,即训练集由正常观测值和标签未知的异常值混合组成。我们给出了基于深度的检测器的误报概率的上界。这些上界可用于确定阈值。我们对合成数据和来自实际应用的数据集进行了广泛的实验。将所提出的异常值检测器与现有方法进行了比较。KSD异常值检测器表现出具有竞争力的性能。