Suppr超能文献

WE聚类:用于大型数据集的基于词嵌入的文本聚类技术。

WEClustering: word embeddings based text clustering technique for large datasets.

作者信息

Mehta Vivek, Bawa Seema, Singh Jasmeet

机构信息

Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab 147001 India.

出版信息

Complex Intell Systems. 2021;7(6):3211-3224. doi: 10.1007/s40747-021-00512-9. Epub 2021 Sep 7.

Abstract

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named "Bidirectional Encoders Representations using Transformers". The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

摘要

现在,大量的文本数据以研究文章、新闻文章、评论、维基百科文章和书籍等形式存在于数字存储库中。文本聚类是一种用于执行分类、主题提取和信息检索的基本数据挖掘技术。文本数据集,尤其是包含大量文档的数据集,是稀疏的且具有高维度。因此,传统的聚类技术,如K均值、凝聚聚类和DBSCAN,表现不佳。本文提出了一种特别适用于大型文本数据集的聚类技术,该技术克服了这些局限性。所提出的技术基于从最近一种名为“基于变换器的双向编码器表示”的深度学习模型中导出的词嵌入。所提出的技术被命名为WEClustering。所提出的技术以有效方式处理高维度问题,因此形成了更准确的聚类。该技术在几个不同大小的数据集上进行了验证,并将其性能与其他广泛使用的和最先进的聚类技术进行了比较。实验比较表明,所提出的聚类技术在纯度和调整兰德指数等指标衡量下,比其他技术有显著改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/fbaec2dfae30/40747_2021_512_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验