Gregg John T, Moore Jason H
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, 90069, USA.
BioData Min. 2023 Sep 4;16(1):25. doi: 10.1186/s13040-023-00342-0.
There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.
目前还没有任何单变量异常值检测算法能够对任意形状的分布进行变换和建模以去除单变量异常值。一些算法对偏度进行建模,对峰度进行建模的更少,而且没有一个算法对双峰性和单调性进行建模。为了克服这些挑战,我们实现了一种用于偏度和尾部沉重调整异常值去除(STAR_outliers)的算法,该算法能够稳健地从具有许多不同形状特征的分布中去除单变量异常值,包括极端偏度、极端峰度、双峰性和单调性。我们表明,STAR_outliers在召回率和精度方面比几种通用算法能更好地去除模拟异常值,并且它还能更准确地对真实数据分布的异常值边界进行建模。
背景
从任意形状的分布中可靠地去除单变量异常值是一项艰巨的任务。错误地假设单峰性或高估尾部沉重程度会导致无法去除异常值,而低估尾部沉重程度则会错误地从尾部去除正常数据。偏度通常会产生一个重尾和一个轻尾,我们表明几种复杂的异常值去除算法通常无法从轻尾中去除异常值。多变量异常值检测算法最近变得很流行,但在测试了PyOD的多变量异常值去除算法后,我们发现它们不足以用于单变量异常值去除。它们通常不允许单变量输入,并且它们没有用一个可以准确建立异常值阈值的模型来拟合其异常值得分的分布。因此,需要一种灵活的异常值去除算法,能够对任意形状的单变量分布进行建模。
结果
为了有效地对任意形状的单变量分布进行建模,我们将几种成熟的算法组合成一种新的算法,称为STAR_outliers。与其他几种单变量算法相比,STAR_outliers能够去除更多的模拟真实异常值,而去除的非异常值更少。这些算法包括几种假设正态性的异常值去除方法、默认设置下的PyOD的孤立森林(IF)异常值去除算法(《ACM数据知识发现汇刊》(TKDD)6:3,2012),以及Verardi和Vermandele提出的一种基于四分位距的算法,该算法在考虑偏度和峰度的同时去除异常值(Verardi和Vermandele,《法国统计学会杂志》157:90 - 114,2016)。由于IF算法的默认模型对异常值得分拟合不佳,我们还将孤立森林算法与一个模型进行了比较,该模型按异常值得分递减的顺序去除与STAR_outliers相同数量的数据点。我们还通过设置异常值阈值以保持值落在拟合模型域的主要99.3%范围内,在公开可用的2018年国家健康与营养检查调查(NHANES)数据上对这些算法进行了比较。我们表明,我们的STAR_outliers算法平均从这些特征中去除的值比其他异常值去除方法更接近0.7%。
结论
STAR_outliers是一个易于实现的用于去除异常值的Python包,其性能优于多种常用的单变量异常值去除方法。