Suppr超能文献

KnowSim:一种基于结构化异构信息网络的文档相似度度量方法。

KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks.

作者信息

Wang Chenguang, Song Yangqiu, Li Haoran, Zhang Ming, Han Jiawei

机构信息

School of EECS, Peking University.

Department of Computer Science, University of Illinois at Urbana-Champaign.

出版信息

Proc IEEE Int Conf Data Min. 2015 Nov;2015:1015-1020. doi: 10.1109/ICDM.2015.131.

Abstract

As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta-paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.

摘要

作为一项基础任务,文档相似度度量对基于文档的分类、聚类和排序有着广泛影响。传统方法将文档表示为词袋模型,并使用余弦、杰卡德和骰子系数等度量来计算文档相似度。然而,文档中的实体短语而非单个单词对于评估文档相关性可能至关重要。此外,实体类型以及实体/单词之间的链接也具有信息价值。我们提出一种方法,将文档表示为一个带类型的异构信息网络(HIN),其中实体和关系都带有类型标注。多个文档可以通过HIN中的单词和实体进行链接。因此,我们将文档相似度问题转化为一个图距离问题。直观地讲,一对文档之间可能存在多条路径。我们建议使用HIN中定义的元路径来计算文档之间的距离。为避免让用户定义有意义的元路径,我们提出一种自动方法来对元路径进行排序。给定与排序分数相关联的元路径,我们提出一种基于HIN的相似度度量KnowSim来计算文档相似度。利用著名的世界知识库Freebase对文档进行语义解析并构建HIN,我们在20新闻组和RCV1数据集上的实验表明,KnowSim生成了令人印象深刻的高质量文档聚类。

相似文献

1
KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks.
Proc IEEE Int Conf Data Min. 2015 Nov;2015:1015-1020. doi: 10.1109/ICDM.2015.131.
4
Inductive Meta-Path Learning for Schema-Complex Heterogeneous Information Networks.
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10196-10209. doi: 10.1109/TPAMI.2024.3435055. Epub 2024 Nov 6.
5
Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents.
J Biomed Inform. 2017 Nov;75:122-127. doi: 10.1016/j.jbi.2017.09.014. Epub 2017 Oct 3.
7
Search and Graph Database Technologies for Biomedical Semantic Indexing: Experimental Analysis.
JMIR Med Inform. 2017 Dec 1;5(4):e48. doi: 10.2196/medinform.7059.
8
Clustering on heterogeneous IoT information network based on meta path.
Sci Prog. 2024 Apr-Jun;107(2):368504241257389. doi: 10.1177/00368504241257389.
9
Ontology-based structured cosine similarity in document summarization: with applications to mobile audio-based knowledge management.
IEEE Trans Syst Man Cybern B Cybern. 2005 Oct;35(5):1028-40. doi: 10.1109/tsmcb.2005.850153.
10
Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.
J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验