使用具有分组、评分和建模方法的集成主题建模进行文本分类的主题选择

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.

作者信息

Voskergian Daniel, Jayousi Rashid, Yousef Malik

机构信息

Computer Engineering Department, Al-Quds University, Jerusalem, Palestine.

Computer Science Department, Al-Quds University, Jerusalem, Palestine.

出版信息

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

DOI:10.1038/s41598-024-74022-2

PMID:39384798

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11464685/

Abstract

TextNetTopics (Yousef et al. in Front Genet 13:893378, 2022. https://doi.org/10.3389/fgene.2022.893378 ) is a recently developed approach that performs text classification-based topics (a topic is a group of terms or words) extracted from a Latent Dirichlet Allocation topic modeling as features rather than individual words. Following this approach enables TextNetTopics to fulfill dimensionality reduction while preserving and embedding more thematic and semantic information into the text document representations. In this article, we introduced a novel approach, the Ensemble Topic Model for Topic Selection (ENTM-TS), an advancement of TextNetTopics. ENTM-TS integrates multiple topic models using the Grouping, Scoring, and Modeling approach, thereby mitigating the performance variability introduced by employing individual topic modeling methods within TextNetTopics. Additionally, we performed a thorough comparative study to evaluate TextNetTopics' performance using eleven state-of-the-art topic modeling algorithms. We used the extracted topics for each as input to the G component in the TextNetTopics tool to select the most compelling topic model regarding their predictive behavior for text classification. We conducted our comprehensive evaluation utilizing the Drug-Induced Liver Injury textual dataset from the CAMDA community and the WOS-5736 dataset. The experimental results show that the Latent Semantic Indexing provides comparable performance measures with fewer discriminative features when compared with other topic modeling methods. Moreover, our evaluation reveals that the performance of ENTM-TS surpasses or aligns with the optimal outcomes obtained from individual topic models across the two datasets, establishing it as a robust and effective enhancement in text classification tasks.

摘要

TextNetTopics（Yousef等人，《Front Genet》，2022年第13卷：893378，https://doi.org/10.3389/fgene.2022.893378 ）是一种最近开发的方法，它将基于文本分类的主题（主题是一组术语或单词）作为特征，这些主题是从潜在狄利克雷分配主题模型中提取的，而不是单个单词。采用这种方法能使TextNetTopics在降维的同时，将更多的主题和语义信息保留并嵌入到文本文档表示中。在本文中，我们介绍了一种新颖的方法，即用于主题选择的集成主题模型（ENTM-TS），它是TextNetTopics的改进。ENTM-TS使用分组、评分和建模方法集成多个主题模型，从而减轻了在TextNetTopics中采用单个主题建模方法所引入的性能可变性。此外，我们进行了全面的比较研究，使用十一种最先进的主题建模算法来评估TextNetTopics的性能。我们将为每个算法提取的主题作为输入，输入到TextNetTopics工具中的G组件，以根据它们对文本分类的预测行为选择最有说服力的主题模型。我们利用来自CAMDA社区的药物性肝损伤文本数据集和WOS-5736数据集进行了全面评估。实验结果表明，与其他主题建模方法相比，潜在语义索引在具有较少判别特征的情况下提供了可比的性能指标。此外，我们的评估表明，ENTM-TS的性能在两个数据集中都超过或与单个主题模型获得的最佳结果相当，这使其成为文本分类任务中一种强大而有效的改进方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f173/11464685/d609eeecd5cd/41598_2024_74022_Fig1_HTML.jpg

相似文献

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.使用具有分组、评分和建模方法的集成主题建模进行文本分类的主题选择

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题：基于文本分类的词群分组作为主题及主题评分

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.TextNetTopics Pro，一种基于主题模型的短文本分类方法，通过整合语义和文档主题分布信息实现。

Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.

An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。

Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.

Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。

Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.

Topic2features: a novel framework to classify noisy and sparse textual data using LDA topic distributions.主题2特征：一种使用LDA主题分布对噪声和稀疏文本数据进行分类的新颖框架。

PeerJ Comput Sci. 2021 Aug 11;7:e677. doi: 10.7717/peerj-cs.677. eCollection 2021.

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.使用主题建模方法处理短文本数据：一项比较分析。

Front Artif Intell. 2020 Jul 14;3:42. doi: 10.3389/frai.2020.00042. eCollection 2020.

Improving the utility of MeSH® terms using the TopicalMeSH representation.使用主题词表（TopicalMeSH）表示法提高医学主题词表（MeSH®）术语的实用性。

J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19.

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用，以实现短文本的可解释主题。

Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.

Semantic relational machine learning model for sentiment analysis using cascade feature selection and heterogeneous classifier ensemble.基于级联特征选择和异构分类器集成的语义关系机器学习情感分析模型。

PeerJ Comput Sci. 2022 Sep 20;8:e1100. doi: 10.7717/peerj-cs.1100. eCollection 2022.

引用本文的文献

Impact of exercise on outcomes among Chinese patients with Crohn's disease: a mixed methods study based on social media and the real world.运动对中国克罗恩病患者结局的影响：基于社交媒体和真实世界的混合方法研究。

BMC Gastroenterol. 2024 Nov 29;24(1):441. doi: 10.1186/s12876-024-03533-z.

本文引用的文献

miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach.miRGediNET：对miRNA-靶标相互作用和疾病关联中常见基因的全面研究：来自分组评分建模方法的见解

Heliyon. 2023 Nov 22;9(12):e22666. doi: 10.1016/j.heliyon.2023.e22666. eCollection 2023 Dec.

Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.

GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning.基因本体论：通过利用基于生物知识的机器学习对基因表达数据进行分组、评分和建模来识别受影响的基因本体术语。

Front Genet. 2023 Aug 21;14:1139082. doi: 10.3389/fgene.2023.1139082. eCollection 2023.

Review of feature selection approaches based on grouping of features.基于特征分组的特征选择方法综述。

PeerJ. 2023 Jul 17;11:e15666. doi: 10.7717/peerj.15666. eCollection 2023.

Invention of 3Mint for feature grouping and scoring in multi-omics.用于多组学中特征分组和评分的3Mint的发明。

Front Genet. 2023 Mar 15;14:1093326. doi: 10.3389/fgene.2023.1093326. eCollection 2023.

PriPath: identifying dysregulated pathways from differential gene expression via grouping, scoring, and modeling with an embedded feature selection approach.PriPath：通过分组、评分和建模，并结合嵌入式特征选择方法，从差异基因表达中识别失调途径。

BMC Bioinformatics. 2023 Feb 23;24(1):60. doi: 10.1186/s12859-023-05187-2.

miRdisNET: Discovering microRNA biomarkers that are associated with diseases utilizing biological knowledge-based machine learning.miRdisNET：利用基于生物学知识的机器学习发现与疾病相关的微小RNA生物标志物。

Front Genet. 2023 Jan 12;13:1076554. doi: 10.3389/fgene.2022.1076554. eCollection 2022.

GediNET for discovering gene associations across diseases using knowledge based machine learning approach.基于知识的机器学习方法发现疾病间基因关联的 GediNET。

Sci Rep. 2022 Nov 19;12(1):19955. doi: 10.1038/s41598-022-24421-0.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题：基于文本分类的词群分组作为主题及主题评分

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

Topic Modeling for Interpretable Text Classification From EHRs.用于电子健康记录可解释文本分类的主题建模

Front Big Data. 2022 May 4;5:846930. doi: 10.3389/fdata.2022.846930. eCollection 2022.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用具有分组、评分和建模方法的集成主题建模进行文本分类的主题选择

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献