Suppr超能文献

一种基于索引的用于潜在语义分析快速在线查询处理的算法。

An index-based algorithm for fast on-line query processing of latent semantic analysis.

作者信息

Zhang Mingxi, Li Pohan, Wang Wei

机构信息

College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China.

School of Computer Science, Fudan University, Shanghai, China.

出版信息

PLoS One. 2017 May 16;12(5):e0177523. doi: 10.1371/journal.pone.0177523. eCollection 2017.

Abstract

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.

摘要

潜在语义分析(LSA)被广泛用于查找语义与关键词查询相似的文档。尽管LSA能产生很有前景的相似结果,但现有的LSA算法在在线查询处理的相似度计算和候选检查中涉及大量不必要的操作,这在时间成本方面很高,并且尤其当数据集变大时不能有效地响应查询请求。在本文中,我们研究LSA在线查询处理的效率问题,以便高效地搜索与给定查询相似的文档。我们结合一个称为部分相似度的中间值重写LSA的相似度方程,该中间值存储在一个称为部分索引的设计索引中。为了减少搜索空间,我们给出相似度方程的一种近似形式,然后开发一种用于构建部分索引的高效算法,该算法跳过低于给定阈值θ的部分相似度。基于部分索引,我们开发一种称为ILSA的高效算法来支持快速在线查询处理。将给定查询转换为伪文档向量,并通过累积从索引节点获得的部分相似度来计算查询与候选文档之间的相似度,这些索引节点对应于伪文档向量中的非零项。与LSA算法相比,ILSA通过修剪没有前景的候选文档并跳过对相似度得分贡献不大的操作,降低了在线查询处理的时间成本。通过与LSA的比较进行了大量实验,这些实验证明了我们提出的算法的效率和有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3083/5433746/38721f320b52/pone.0177523.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验