KnowSim：一种基于结构化异构信息网络的文档相似度度量方法。

KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks.

作者信息

Wang Chenguang, Song Yangqiu, Li Haoran, Zhang Ming, Han Jiawei

机构信息

School of EECS, Peking University.

Department of Computer Science, University of Illinois at Urbana-Champaign.

出版信息

Proc IEEE Int Conf Data Min. 2015 Nov;2015:1015-1020. doi: 10.1109/ICDM.2015.131.

DOI:10.1109/ICDM.2015.131

PMID:27034626

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4811603/

Abstract

As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta-paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.

摘要

作为一项基础任务，文档相似度度量对基于文档的分类、聚类和排序有着广泛影响。传统方法将文档表示为词袋模型，并使用余弦、杰卡德和骰子系数等度量来计算文档相似度。然而，文档中的实体短语而非单个单词对于评估文档相关性可能至关重要。此外，实体类型以及实体/单词之间的链接也具有信息价值。我们提出一种方法，将文档表示为一个带类型的异构信息网络（HIN），其中实体和关系都带有类型标注。多个文档可以通过HIN中的单词和实体进行链接。因此，我们将文档相似度问题转化为一个图距离问题。直观地讲，一对文档之间可能存在多条路径。我们建议使用HIN中定义的元路径来计算文档之间的距离。为避免让用户定义有意义的元路径，我们提出一种自动方法来对元路径进行排序。给定与排序分数相关联的元路径，我们提出一种基于HIN的相似度度量KnowSim来计算文档相似度。利用著名的世界知识库Freebase对文档进行语义解析并构建HIN，我们在20新闻组和RCV1数据集上的实验表明，KnowSim生成了令人印象深刻的高质量文档聚类。