Shao Chen, Du Xusheng, Yu Jiong, Chen Jiaying
School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China.
Entropy (Basel). 2022 Apr 27;24(5):611. doi: 10.3390/e24050611.
Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by randomly dividing features of the data set in the Isolation Forest algorithm in outlier detection, an algorithm CIIF (Cluster-based Improved Isolation Forest) that combines clustering and Isolation Forest is proposed. CIIF first uses the -means method to cluster the data set, selects a specific cluster to construct a selection matrix based on the results of the clustering, and implements the selection mechanism of the algorithm through the selection matrix; then builds multiple isolation trees. Finally, the outliers are calculated according to the average search length of each sample in different isolation trees, and the Top-n objects with the highest outlier scores are regarded as outliers. Through comparative experiments with six algorithms in eleven real data sets, the results show that the CIIF algorithm has better performance. Compared to the Isolation Forest algorithm, the average AUC (Area under the Curve of ROC) value of our proposed CIIF algorithm is improved by 7%.
异常值检测是数据挖掘领域的一个重要研究方向。针对异常值检测中孤立森林算法对数据集特征进行随机划分导致检测结果不稳定、效率低的问题,提出了一种将聚类与孤立森林相结合的算法CIIF(基于聚类的改进孤立森林算法)。CIIF首先使用K均值方法对数据集进行聚类,根据聚类结果选择特定的簇来构建选择矩阵,并通过选择矩阵实现算法的选择机制;然后构建多个孤立树。最后,根据每个样本在不同孤立树中的平均搜索长度计算异常值,将异常值得分最高的前n个对象视为异常值。通过在11个真实数据集上与6种算法进行对比实验,结果表明CIIF算法具有更好的性能。与孤立森林算法相比,我们提出的CIIF算法的平均AUC(ROC曲线下面积)值提高了7%。