Voskergian Daniel, Jayousi Rashid, Yousef Malik
Computer Engineering Department, Al-Quds University, Jerusalem, Palestine.
Computer Science Department, Al-Quds University, Jerusalem, Palestine.
Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.
TextNetTopics (Yousef et al. in Front Genet 13:893378, 2022. https://doi.org/10.3389/fgene.2022.893378 ) is a recently developed approach that performs text classification-based topics (a topic is a group of terms or words) extracted from a Latent Dirichlet Allocation topic modeling as features rather than individual words. Following this approach enables TextNetTopics to fulfill dimensionality reduction while preserving and embedding more thematic and semantic information into the text document representations. In this article, we introduced a novel approach, the Ensemble Topic Model for Topic Selection (ENTM-TS), an advancement of TextNetTopics. ENTM-TS integrates multiple topic models using the Grouping, Scoring, and Modeling approach, thereby mitigating the performance variability introduced by employing individual topic modeling methods within TextNetTopics. Additionally, we performed a thorough comparative study to evaluate TextNetTopics' performance using eleven state-of-the-art topic modeling algorithms. We used the extracted topics for each as input to the G component in the TextNetTopics tool to select the most compelling topic model regarding their predictive behavior for text classification. We conducted our comprehensive evaluation utilizing the Drug-Induced Liver Injury textual dataset from the CAMDA community and the WOS-5736 dataset. The experimental results show that the Latent Semantic Indexing provides comparable performance measures with fewer discriminative features when compared with other topic modeling methods. Moreover, our evaluation reveals that the performance of ENTM-TS surpasses or aligns with the optimal outcomes obtained from individual topic models across the two datasets, establishing it as a robust and effective enhancement in text classification tasks.
TextNetTopics(Yousef等人,《Front Genet》,2022年第13卷:893378,https://doi.org/10.3389/fgene.2022.893378 )是一种最近开发的方法,它将基于文本分类的主题(主题是一组术语或单词)作为特征,这些主题是从潜在狄利克雷分配主题模型中提取的,而不是单个单词。采用这种方法能使TextNetTopics在降维的同时,将更多的主题和语义信息保留并嵌入到文本文档表示中。在本文中,我们介绍了一种新颖的方法,即用于主题选择的集成主题模型(ENTM-TS),它是TextNetTopics的改进。ENTM-TS使用分组、评分和建模方法集成多个主题模型,从而减轻了在TextNetTopics中采用单个主题建模方法所引入的性能可变性。此外,我们进行了全面的比较研究,使用十一种最先进的主题建模算法来评估TextNetTopics的性能。我们将为每个算法提取的主题作为输入,输入到TextNetTopics工具中的G组件,以根据它们对文本分类的预测行为选择最有说服力的主题模型。我们利用来自CAMDA社区的药物性肝损伤文本数据集和WOS-5736数据集进行了全面评估。实验结果表明,与其他主题建模方法相比,潜在语义索引在具有较少判别特征的情况下提供了可比的性能指标。此外,我们的评估表明,ENTM-TS的性能在两个数据集中都超过或与单个主题模型获得的最佳结果相当,这使其成为文本分类任务中一种强大而有效的改进方法。