Suppr超能文献

作为文本聚类数据表示的BERT性能。

The performance of BERT as data representation of text clustering.

作者信息

Subakti Alvin, Murfi Hendri, Hariadi Nora

机构信息

Department of Mathematics, Universitas Indonesia, Depok, 16424 Indonesia.

出版信息

J Big Data. 2022;9(1):15. doi: 10.1186/s40537-022-00564-9. Epub 2022 Feb 8.

Abstract

Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. One of the most frequently used method to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

摘要

文本聚类是将一组文本进行分组的任务,以便同一组中的文本比来自不同组的文本更相似。手动对文本进行分组的过程需要大量的时间和人力。因此,利用机器学习进行自动化是必要的。表示文本数据最常用的方法之一是词频逆文档频率(TFIDF)。然而,TFIDF无法考虑单词在句子中的位置和上下文。来自变换器的双向编码器表示(BERT)模型可以生成包含单词在句子中的位置和上下文的文本表示。本研究分析了BERT模型作为文本数据表示的性能。此外,各种特征提取和归一化方法也应用于BERT模型的数据表示。为了检验BERT的性能,我们使用了四种聚类算法,即k均值聚类、基于特征空间的模糊c均值、深度嵌入聚类和改进的深度嵌入聚类。我们的模拟表明,在36个指标中的28个指标上,BERT的性能优于TFIDF方法。此外,不同的特征提取和归一化产生了不同的性能。这些特征提取和归一化的使用必须根据所使用的文本聚类算法进行调整。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验