Kuck Jonathan, Zhuang Honglei, Yan Xifeng, Cam Hasan, Han Jiawei
Department of Computer Science, University of Illinois at Urbana-Champaign.
Computer Science Department, University of California at Santa Barbara.
Adv Database Technol. 2015 Mar;2015:325-336. doi: 10.5441/002/edbt.2015.29.
Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user's search space and interest. It is often more effective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries efficiently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries efficiently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks.
大数据集中的离群值或异常检测是数据科学中的一项基本任务,具有广泛的应用。然而,在高维空间的实际数据集中,大多数离群值隐藏在某些维度组合中,并且与用户的搜索空间和兴趣相关。赋予用户权力并允许他们灵活指定离群值查询通常更有效,然后系统将高效地处理此类挖掘查询。在本研究中,我们引入了异构信息网络中基于查询的离群值概念,设计了一种查询语言以方便用户灵活指定此类查询,在异构网络中定义了一种良好的离群值度量,并研究了如何在大数据集中高效地处理离群值查询。我们在实际数据集上的实验表明,遵循这种方法,可以在大型异构网络中灵活有效地定义和发现有趣的离群值。