• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于聚类的改进孤立森林

Cluster-Based Improved Isolation Forest.

作者信息

Shao Chen, Du Xusheng, Yu Jiong, Chen Jiaying

机构信息

School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China.

出版信息

Entropy (Basel). 2022 Apr 27;24(5):611. doi: 10.3390/e24050611.

DOI:10.3390/e24050611
PMID:35626495
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9141139/
Abstract

Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by randomly dividing features of the data set in the Isolation Forest algorithm in outlier detection, an algorithm CIIF (Cluster-based Improved Isolation Forest) that combines clustering and Isolation Forest is proposed. CIIF first uses the -means method to cluster the data set, selects a specific cluster to construct a selection matrix based on the results of the clustering, and implements the selection mechanism of the algorithm through the selection matrix; then builds multiple isolation trees. Finally, the outliers are calculated according to the average search length of each sample in different isolation trees, and the Top-n objects with the highest outlier scores are regarded as outliers. Through comparative experiments with six algorithms in eleven real data sets, the results show that the CIIF algorithm has better performance. Compared to the Isolation Forest algorithm, the average AUC (Area under the Curve of ROC) value of our proposed CIIF algorithm is improved by 7%.

摘要

异常值检测是数据挖掘领域的一个重要研究方向。针对异常值检测中孤立森林算法对数据集特征进行随机划分导致检测结果不稳定、效率低的问题,提出了一种将聚类与孤立森林相结合的算法CIIF(基于聚类的改进孤立森林算法)。CIIF首先使用K均值方法对数据集进行聚类,根据聚类结果选择特定的簇来构建选择矩阵,并通过选择矩阵实现算法的选择机制;然后构建多个孤立树。最后,根据每个样本在不同孤立树中的平均搜索长度计算异常值,将异常值得分最高的前n个对象视为异常值。通过在11个真实数据集上与6种算法进行对比实验,结果表明CIIF算法具有更好的性能。与孤立森林算法相比,我们提出的CIIF算法的平均AUC(ROC曲线下面积)值提高了7%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/f1b23ed7d3b6/entropy-24-00611-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/d195a452ba3f/entropy-24-00611-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/32eec646b88e/entropy-24-00611-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/3d3d37123820/entropy-24-00611-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/0b0f97562268/entropy-24-00611-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/bd787040f800/entropy-24-00611-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/6ce045198890/entropy-24-00611-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/f1b23ed7d3b6/entropy-24-00611-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/d195a452ba3f/entropy-24-00611-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/32eec646b88e/entropy-24-00611-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/3d3d37123820/entropy-24-00611-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/0b0f97562268/entropy-24-00611-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/bd787040f800/entropy-24-00611-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/6ce045198890/entropy-24-00611-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22a0/9141139/f1b23ed7d3b6/entropy-24-00611-g007.jpg

相似文献

1
Cluster-Based Improved Isolation Forest.基于聚类的改进孤立森林
Entropy (Basel). 2022 Apr 27;24(5):611. doi: 10.3390/e24050611.
2
How the Outliers Influence the Quality of Clustering?异常值如何影响聚类质量?
Entropy (Basel). 2022 Jun 30;24(7):917. doi: 10.3390/e24070917.
3
A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost.一种基于位置划分模型和初始值离群点强化K均值的新型模型,用于降低数据成本。
Entropy (Basel). 2020 Aug 17;22(8):902. doi: 10.3390/e22080902.
4
A method for detecting abnormal behavior of ships based on multi-dimensional density distance and an abnormal isolation mechanism.一种基于多维密度距离和异常隔离机制的船舶异常行为检测方法。
Math Biosci Eng. 2023 Jun 20;20(8):13921-13946. doi: 10.3934/mbe.2023620.
5
STAR_outliers: a python package that separates univariate outliers from non-normal distributions.STAR异常值:一个用于从非正态分布中分离单变量异常值的Python包。
BioData Min. 2023 Sep 4;16(1):25. doi: 10.1186/s13040-023-00342-0.
6
Augmented Intelligence for Clinical Discovery in Hypertensive Disorders of Pregnancy Using Outlier Analysis.利用异常值分析的妊娠高血压疾病临床发现增强智能技术
Cureus. 2023 Mar 30;15(3):e36909. doi: 10.7759/cureus.36909. eCollection 2023 Mar.
7
Entropy-based grid approach for handling outliers: a case study to environmental monitoring data.基于熵的网格方法处理异常值:以环境监测数据为例。
Environ Sci Pollut Res Int. 2023 Dec;30(60):125138-125157. doi: 10.1007/s11356-023-26780-1. Epub 2023 Jun 12.
8
Research and Application of Clustering Algorithm for Text Big Data.文本大数据聚类算法的研究与应用
Comput Intell Neurosci. 2022 Jun 8;2022:7042778. doi: 10.1155/2022/7042778. eCollection 2022.
9
FilterK: A new outlier detection method for k-means clustering of physical activity.FilterK:一种用于身体活动 k 均值聚类的新异常值检测方法。
J Biomed Inform. 2020 Apr;104:103397. doi: 10.1016/j.jbi.2020.103397. Epub 2020 Feb 26.
10
A Weakly Supervised Gas-Path Anomaly Detection Method for Civil Aero-Engines Based on Mapping Relationship Mining of Gas-Path Parameters and Improved Density Peak Clustering.基于气路参数映射关系挖掘和改进密度峰值聚类的民用航空发动机弱监督气路异常检测方法
Sensors (Basel). 2021 Jul 1;21(13):4526. doi: 10.3390/s21134526.

引用本文的文献

1
Gate-Level Circuit Partitioning Algorithm Based on Clustering and an Improved Genetic Algorithm.基于聚类和改进遗传算法的门级电路划分算法
Entropy (Basel). 2023 Mar 31;25(4):597. doi: 10.3390/e25040597.
2
Power Disturbance Monitoring through Techniques for Novelty Detection on Wind Power and Photovoltaic Generation.通过风力发电和光伏发电新颖性检测技术进行电能质量扰动监测。
Sensors (Basel). 2023 Mar 7;23(6):2908. doi: 10.3390/s23062908.
3
Grid-Based Clustering Using Boundary Detection.基于网格的边界检测聚类

本文引用的文献

1
Applying density-based outlier identifications using multiple datasets for validation of stroke clinical outcomes.应用基于密度的异常值识别方法,结合多个数据集对脑卒中临床结局进行验证。
Int J Med Inform. 2019 Dec;132:103988. doi: 10.1016/j.ijmedinf.2019.103988. Epub 2019 Oct 3.
2
Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes.基于网络的微同线性分析鉴定了哺乳动物和被子植物基因组中的主要差异和基因组异常。
Proc Natl Acad Sci U S A. 2019 Feb 5;116(6):2165-2174. doi: 10.1073/pnas.1801757116. Epub 2019 Jan 23.
3
Genetic K-means algorithm.
Entropy (Basel). 2022 Nov 4;24(11):1606. doi: 10.3390/e24111606.
遗传K均值算法
IEEE Trans Syst Man Cybern B Cybern. 1999;29(3):433-9. doi: 10.1109/3477.764879.