• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

STAR异常值:一个用于从非正态分布中分离单变量异常值的Python包。

STAR_outliers: a python package that separates univariate outliers from non-normal distributions.

作者信息

Gregg John T, Moore Jason H

机构信息

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, 90069, USA.

出版信息

BioData Min. 2023 Sep 4;16(1):25. doi: 10.1186/s13040-023-00342-0.

DOI:10.1186/s13040-023-00342-0
PMID:37667378
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10476292/
Abstract

There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/a6ec8191333f/13040_2023_342_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/9e5f4428170d/13040_2023_342_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/e8ef878c5736/13040_2023_342_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/2659c1c744f7/13040_2023_342_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/0296026f1514/13040_2023_342_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/a6ec8191333f/13040_2023_342_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/9e5f4428170d/13040_2023_342_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/e8ef878c5736/13040_2023_342_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/2659c1c744f7/13040_2023_342_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/0296026f1514/13040_2023_342_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a1/10476292/a6ec8191333f/13040_2023_342_Fig5_HTML.jpg
摘要

目前还没有任何单变量异常值检测算法能够对任意形状的分布进行变换和建模以去除单变量异常值。一些算法对偏度进行建模,对峰度进行建模的更少,而且没有一个算法对双峰性和单调性进行建模。为了克服这些挑战,我们实现了一种用于偏度和尾部沉重调整异常值去除(STAR_outliers)的算法,该算法能够稳健地从具有许多不同形状特征的分布中去除单变量异常值,包括极端偏度、极端峰度、双峰性和单调性。我们表明,STAR_outliers在召回率和精度方面比几种通用算法能更好地去除模拟异常值,并且它还能更准确地对真实数据分布的异常值边界进行建模。

背景

从任意形状的分布中可靠地去除单变量异常值是一项艰巨的任务。错误地假设单峰性或高估尾部沉重程度会导致无法去除异常值,而低估尾部沉重程度则会错误地从尾部去除正常数据。偏度通常会产生一个重尾和一个轻尾,我们表明几种复杂的异常值去除算法通常无法从轻尾中去除异常值。多变量异常值检测算法最近变得很流行,但在测试了PyOD的多变量异常值去除算法后,我们发现它们不足以用于单变量异常值去除。它们通常不允许单变量输入,并且它们没有用一个可以准确建立异常值阈值的模型来拟合其异常值得分的分布。因此,需要一种灵活的异常值去除算法,能够对任意形状的单变量分布进行建模。

结果

为了有效地对任意形状的单变量分布进行建模,我们将几种成熟的算法组合成一种新的算法,称为STAR_outliers。与其他几种单变量算法相比,STAR_outliers能够去除更多的模拟真实异常值,而去除的非异常值更少。这些算法包括几种假设正态性的异常值去除方法、默认设置下的PyOD的孤立森林(IF)异常值去除算法(《ACM数据知识发现汇刊》(TKDD)6:3,2012),以及Verardi和Vermandele提出的一种基于四分位距的算法,该算法在考虑偏度和峰度的同时去除异常值(Verardi和Vermandele,《法国统计学会杂志》157:90 - 114,2016)。由于IF算法的默认模型对异常值得分拟合不佳,我们还将孤立森林算法与一个模型进行了比较,该模型按异常值得分递减的顺序去除与STAR_outliers相同数量的数据点。我们还通过设置异常值阈值以保持值落在拟合模型域的主要99.3%范围内,在公开可用的2018年国家健康与营养检查调查(NHANES)数据上对这些算法进行了比较。我们表明,我们的STAR_outliers算法平均从这些特征中去除的值比其他异常值去除方法更接近0.7%。

结论

STAR_outliers是一个易于实现的用于去除异常值的Python包,其性能优于多种常用的单变量异常值去除方法。

相似文献

1
STAR_outliers: a python package that separates univariate outliers from non-normal distributions.STAR异常值:一个用于从非正态分布中分离单变量异常值的Python包。
BioData Min. 2023 Sep 4;16(1):25. doi: 10.1186/s13040-023-00342-0.
2
Data-driven evolution of water quality models: An in-depth investigation of innovative outlier detection approaches-A case study of Irish Water Quality Index (IEWQI) model.水质模型的数据驱动演变:创新异常值检测方法的深入研究——以爱尔兰水质指数(IEWQI)模型为例
Water Res. 2024 May 15;255:121499. doi: 10.1016/j.watres.2024.121499. Epub 2024 Mar 20.
3
An integrated approach for identifying wrongly labelled samples when performing classification in microarray data.一种在微阵列数据分析中进行分类时识别错误标记样本的综合方法。
PLoS One. 2012;7(10):e46700. doi: 10.1371/journal.pone.0046700. Epub 2012 Oct 17.
4
Detection of outliers in reference distributions: performance of Horn's algorithm.参考分布中异常值的检测:霍恩算法的性能
Clin Chem. 2005 Dec;51(12):2326-32. doi: 10.1373/clinchem.2005.058339. Epub 2005 Oct 13.
5
Comparing Methods for Measurement Error Detection in Serial 24-h Hormonal Data.比较串联 24 小时激素数据中测量误差检测的方法。
J Biol Rhythms. 2019 Aug;34(4):347-363. doi: 10.1177/0748730419850917. Epub 2019 Jun 12.
6
Augmented Intelligence for Clinical Discovery in Hypertensive Disorders of Pregnancy Using Outlier Analysis.利用异常值分析的妊娠高血压疾病临床发现增强智能技术
Cureus. 2023 Mar 30;15(3):e36909. doi: 10.7759/cureus.36909. eCollection 2023 Mar.
7
Study on outlier detection method of the near infrared spectroscopy analysis by probability metric.基于概率测度的近红外光谱分析异常值检测方法研究。
Spectrochim Acta A Mol Biomol Spectrosc. 2022 Nov 5;280:121473. doi: 10.1016/j.saa.2022.121473. Epub 2022 Jun 6.
8
Unsupervised Outlier Detection Using Memory and Contrastive Learning.基于记忆和对比学习的无监督异常检测。
IEEE Trans Image Process. 2022;31:6440-6454. doi: 10.1109/TIP.2022.3211476. Epub 2022 Oct 21.
9
EnsMOD: A Software Program for Omics Sample Outlier Detection.EnsMOD:一种用于组学样本离群值检测的软件程序。
J Comput Biol. 2023 Jun;30(6):726-735. doi: 10.1089/cmb.2022.0243. Epub 2023 Apr 12.
10
Entropy-based grid approach for handling outliers: a case study to environmental monitoring data.基于熵的网格方法处理异常值:以环境监测数据为例。
Environ Sci Pollut Res Int. 2023 Dec;30(60):125138-125157. doi: 10.1007/s11356-023-26780-1. Epub 2023 Jun 12.

引用本文的文献

1
An Ensemble-Based AI Approach for Continuous Blood Pressure Estimation in Health Monitoring Applications.一种基于集成的人工智能方法用于健康监测应用中的连续血压估计。
Sensors (Basel). 2025 Jul 24;25(15):4574. doi: 10.3390/s25154574.

本文引用的文献

1
Robust Estimation of the Parameters of Distributions, with Applications to Outlier Detection.分布参数的稳健估计及其在异常值检测中的应用
Comput Stat Data Anal. 2014 Jul 1;75:66-80. doi: 10.1016/j.csda.2014.01.003.