基于 DFA 的词形还原 Gujarati 文本中的主题建模。

Modeling Topics in DFA-Based Lemmatized Gujarati Text.

机构信息

Department of Computer Engineering, Vishwakarma Government Engineering College, Chandkheda, Ahmedabad 382424, India.

Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad 382481, India.

出版信息

Sensors (Basel). 2023 Mar 1;23(5):2708. doi: 10.3390/s23052708.

DOI:10.3390/s23052708

PMID:36904915

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10007128/

Abstract

Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model's topic is expected to be interpretable as a concept, i.e., correspond to human understanding of a topic occurring in texts. While discovering corpus themes, inference constantly uses vocabulary that impacts topic quality due to its size. Inflectional forms are in the corpus. Since words frequently appear in the same sentence and are likely to have a latent topic, practically all topic models rely on co-occurrence signals between various terms in the corpus. The topics get weaker because of the abundance of distinct tokens in languages with extensive inflectional morphology. Lemmatization is often used to preempt this problem. Gujarati is one of the morphologically rich languages, as a word may have several inflectional forms. This paper proposes a deterministic finite automaton (DFA) based lemmatization technique for the Gujarati language to transform lemmas into their root words. The set of topics is then inferred from this lemmatized corpus of Gujarati text. We employ statistical divergence measurements to identify semantically less coherent (overly general) topics. The result shows that the lemmatized Gujarati corpus learns more interpretable and meaningful subjects than unlemmatized text. Finally, results show that lemmatization curtails the size of vocabulary decreases by 16% and the semantic coherence for all three measurements-Log Conditional Probability, Pointwise Mutual Information, and Normalized Pointwise Mutual Information-from -9.39 to -7.49, -6.79 to -5.18, and -0.23 to -0.17, respectively.

摘要

主题建模是一种基于统计学的机器学习算法，采用无监督机器学习技术，将高维语料库映射到低维主题子空间，但可以做得更好。主题模型的主题预计是可解释的概念，即对应于文本中出现的主题的人类理解。在发现语料库主题时，由于词汇量大小的影响，推理不断使用影响主题质量的词汇。屈折形式在语料库中。由于单词经常出现在同一个句子中，并且很可能具有潜在的主题，因此实际上所有主题模型都依赖于语料库中各种术语之间的共现信号。由于具有丰富屈折形态的语言中存在大量不同的标记，主题会变得更弱。词形变化常常被用来解决这个问题。古吉拉特语是一种形态丰富的语言，一个词可能有几种屈折形式。本文提出了一种基于确定性有限自动机（DFA）的古吉拉特语词形变化技术，将词形变化后的词转化为词根。然后从这个词形变化后的古吉拉特语语料库中推断出主题。我们采用统计散度测量来识别语义上不太连贯（过于一般）的主题。结果表明，词形变化后的古吉拉特语语料库比未词形变化的文本学习到更多可解释和有意义的主题。最后，结果表明，词形变化将词汇量减少 16%，并且所有三个测量值的语义连贯性分别从-9.39 降至-7.49、-6.79 降至-5.18 和-0.23 降至-0.17，分别为对数条件概率、点互信息和归一化点互信息。