George Lijimol, Sumathy P
Department of Computer Science, Bharathidasan University, Tiruchirappalli, 620 023 Tamil Nadu India.
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.
Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.
主题建模是一种机器学习技术,在自然语言处理(NLP)应用中被广泛用于推断非结构化文本数据中的主题。潜在狄利克雷分配(LDA)是最常用的主题建模技术之一,它可以从大量文本文档中自动检测主题。然而,仅基于LDA的主题模型并不总是能提供理想的结果。聚类是一种有效的无监督机器学习算法,广泛应用于包括从非结构化文本数据中提取信息和主题建模等应用中。已经详细研究了在主题建模中结合基于降维的聚类的双向编码器表征来自变换器(BERT)和潜在狄利克雷分配(LDA)的混合模型。由于聚类算法计算复杂,且随着特征数量的增加复杂度也会增加,因此还执行了基于主成分分析(PCA)、t-分布随机邻域嵌入(t-SNE)和均匀流形近似与投影(UMAP)的降维方法。最后,作为本研究的一部分,提出了一个使用BERT和LDA的基于统一聚类的框架,用于从海量文本语料库中挖掘出一组有意义的数据。通过在基准数据集上模拟用户输入,进行实验以证明使用BERT和LDA的聚类辅助主题建模框架的有效性。实验结果表明,降维聚类有助于推断出更连贯的主题,因此这种基于统一聚类和BERT-LDA的方法可以有效地用于构建主题建模应用程序。