基于if-then规则的知识库中的异常值探索

Exploration of Outliers in If-Then Rule-Based Knowledge Bases.

作者信息

Nowak-Brzezińska Agnieszka, Horyń Czesław

机构信息

Institute of Computer Science, Faculty of Science and Technology, University of Silesia, Bankowa 12, 40-007 Katowice, Poland.

出版信息

Entropy (Basel). 2020 Sep 29;22(10):1096. doi: 10.3390/e22101096.

DOI:10.3390/e22101096

PMID:33286864

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7597194/

Abstract

The article presents both methods of clustering and outlier detection in complex data, such as rule-based knowledge bases. What distinguishes this work from others is, first, the application of clustering algorithms to rules in domain knowledge bases, and secondly, the use of outlier detection algorithms to detect unusual rules in knowledge bases. The aim of the paper is the analysis of using four algorithms for outlier detection in rule-based knowledge bases: Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), -MEANS, and SMALLCLUSTERS. The subject of outlier mining is very important nowadays. Outliers in rules mean unusual rules, which are rare in comparing to others and should be explored by the domain expert as soon as possible. In the research, the authors use the outlier detection methods to find a given number of outliers in rules (1%, 5%, 10%), while in small groups, the number of outliers covers no more than 5% of the rule cluster. Subsequently, the authors analyze which of seven various quality indices, which they use for all rules and after removing selected outliers, improve the quality of rule clusters. In the experimental stage, the authors use six different knowledge bases. The best results (the most often the clusters quality was improved) are achieved for two outlier detection algorithms LOF and COF.

摘要

本文介绍了复杂数据（如基于规则的知识库）中的聚类和异常值检测方法。这项工作与其他工作的不同之处在于，首先，将聚类算法应用于领域知识库中的规则；其次，使用异常值检测算法来检测知识库中的异常规则。本文的目的是分析在基于规则的知识库中使用四种异常值检测算法：局部异常因子（LOF）、基于连通性的异常因子（COF）、-MEANS和SMALLCLUSTERS。如今，异常值挖掘的主题非常重要。规则中的异常值意味着不寻常的规则，与其他规则相比很少见，领域专家应尽快对其进行研究。在研究中，作者使用异常值检测方法在规则中找到给定数量的异常值（1%、5%、10%），而在小群体中，异常值的数量不超过规则集群的5%。随后，作者分析了他们用于所有规则以及去除选定异常值后的七种不同质量指标中的哪一种能提高规则集群的质量。在实验阶段，作者使用了六个不同的知识库。两种异常值检测算法LOF和COF取得了最佳结果（最常提高集群质量）。