一种用于改进主题建模的集成聚类和BERT框架。

An integrated clustering and BERT framework for improved topic modeling.

作者信息

George Lijimol, Sumathy P

机构信息

Department of Computer Science, Bharathidasan University, Tiruchirappalli, 620 023 Tamil Nadu India.

出版信息

Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.

DOI:10.1007/s41870-023-01268-w

PMID:37256029

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10163298/

Abstract

Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.

摘要

主题建模是一种机器学习技术，在自然语言处理（NLP）应用中被广泛用于推断非结构化文本数据中的主题。潜在狄利克雷分配（LDA）是最常用的主题建模技术之一，它可以从大量文本文档中自动检测主题。然而，仅基于LDA的主题模型并不总是能提供理想的结果。聚类是一种有效的无监督机器学习算法，广泛应用于包括从非结构化文本数据中提取信息和主题建模等应用中。已经详细研究了在主题建模中结合基于降维的聚类的双向编码器表征来自变换器（BERT）和潜在狄利克雷分配（LDA）的混合模型。由于聚类算法计算复杂，且随着特征数量的增加复杂度也会增加，因此还执行了基于主成分分析（PCA）、t-分布随机邻域嵌入（t-SNE）和均匀流形近似与投影（UMAP）的降维方法。最后，作为本研究的一部分，提出了一个使用BERT和LDA的基于统一聚类的框架，用于从海量文本语料库中挖掘出一组有意义的数据。通过在基准数据集上模拟用户输入，进行实验以证明使用BERT和LDA的聚类辅助主题建模框架的有效性。实验结果表明，降维聚类有助于推断出更连贯的主题，因此这种基于统一聚类和BERT-LDA的方法可以有效地用于构建主题建模应用程序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1eab/10163298/286c855b61bc/41870_2023_1268_Fig1_HTML.jpg

相似文献

An integrated clustering and BERT framework for improved topic modeling.

Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.

Evaluation of clustering and topic modeling methods over health-related tweets and emails.

Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.

Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.

Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.

Vaccine sentiment analysis using BERT + NBSVM and geo-spatial approaches.

J Supercomput. 2023 May 7:1-31. doi: 10.1007/s11227-023-05319-8.

Monitoring COVID-19 pandemic through the lens of social media using natural language processing and machine learning.

Health Inf Sci Syst. 2021 Jun 25;9(1):25. doi: 10.1007/s13755-021-00158-4. eCollection 2021 Dec.

Web content topic modeling using LDA and HTML tags.

PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.

WEClustering: word embeddings based text clustering technique for large datasets.

Complex Intell Systems. 2021;7(6):3211-3224. doi: 10.1007/s40747-021-00512-9. Epub 2021 Sep 7.

The performance of BERT as data representation of text clustering.

J Big Data. 2022;9(1):15. doi: 10.1186/s40537-022-00564-9. Epub 2022 Feb 8.

An Al-BERT-Bi-GRU-LDA algorithm for negative sentiment analysis on Bilibili comments.

PeerJ Comput Sci. 2024 May 15;10:e2029. doi: 10.7717/peerj-cs.2029. eCollection 2024.

Comparison of Machine-Learning Algorithms for the Prediction of Current Procedural Terminology (CPT) Codes from Pathology Reports.

J Pathol Inform. 2022 Jan 5;13:3. doi: 10.4103/jpi.jpi_52_21. eCollection 2022.

引用本文的文献

Fast2Vec, a modified model of FastText that enhances semantic analysis in topic evolution.

PeerJ Comput Sci. 2025 May 19;11:e2862. doi: 10.7717/peerj-cs.2862. eCollection 2025.

Evolution of AI enabled healthcare systems using textual data with a pretrained BERT deep learning model.

Sci Rep. 2025 Mar 4;15(1):7540. doi: 10.1038/s41598-025-91622-8.

本文引用的文献

Clustering and topic modeling over tweets: A comparison over a health dataset.

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2019 Nov;2019:1544-1547. doi: 10.1109/bibm47256.2019.8983167. Epub 2020 Feb 6.

Anchor-Free Correlated Topic Modeling.

IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.

Nested Hierarchical Dirichlet Processes.

IEEE Trans Pattern Anal Mach Intell. 2015 Feb;37(2):256-70. doi: 10.1109/TPAMI.2014.2318728.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于改进主题建模的集成聚类和BERT框架。

An integrated clustering and BERT framework for improved topic modeling.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献