TextNetTopics Pro，一种基于主题模型的短文本分类方法，通过整合语义和文档主题分布信息实现。

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.

作者信息

Voskergian Daniel, Bakir-Gungor Burcu, Yousef Malik

机构信息

Computer Engineering Department, Faculty of Engineering, Al-Quds University, Jerusalem, Palestine.

Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Türkiye.

出版信息

Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.

DOI:10.3389/fgene.2023.1243874

PMID:37867598

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10585361/

Abstract

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles' content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called , which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.

摘要

随着科学文章每日发表数量呈指数级增长，自动分类有助于将文章归入预定义类别。文章标题是对文章内容的简洁描述，包含对文档分类有用的宝贵信息。然而，科学文档标题的简短性、数据稀疏性、有限的词频以及上下文信息不足，阻碍了传统文本挖掘和机器学习算法在这些短文本上的直接应用，使其分类成为一项具有挑战性的任务。本研究首先探讨我们早期的研究TextNetTopics在短文本上的性能。其次，我们在此提出一个名为的高级版本，它是一种新颖的短文本分类框架，利用按词主题组织的词汇特征和主题模型提取的主题分布的有前景组合，以缓解短文本分类时的数据稀疏问题。我们使用九个最先进的短文本主题模型，在两个作为短文本文档的科学文章标题公开可用数据集上评估我们提出的方法。第一个数据集与生物医学领域相关，另一个与计算机科学出版物相关。此外，我们比较评估了使用和不使用摘要生成的模型的预测性能。最后，我们展示了所提出方法在处理不平衡数据方面的稳健性和有效性，特别是在作为CAMDA挑战一部分的药物性肝损伤文章分类中。利用主题模型检测到的语义信息被证明是提高机器学习分类器整体性能的可靠方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/741f/10585361/c7d4a552956d/fgene-14-1243874-g001.jpg

相似文献

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.TextNetTopics Pro，一种基于主题模型的短文本分类方法，通过整合语义和文档主题分布信息实现。

Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.使用具有分组、评分和建模方法的集成主题建模进行文本分类的主题选择

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题：基于文本分类的词群分组作为主题及主题评分

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类：基于k近邻算法和基于词嵌入语义分析的方法。

J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用，以实现短文本的可解释主题。

Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.

Multi-class classification of COVID-19 documents using machine learning algorithms.使用机器学习算法对新冠病毒疾病文档进行多类别分类。

J Intell Inf Syst. 2023;60(2):571-591. doi: 10.1007/s10844-022-00768-8. Epub 2022 Nov 29.

Short text topic modelling using local and global word-context semantic correlation.使用局部和全局词上下文语义相关性的短文本主题建模

Multimed Tools Appl. 2023 Feb 2:1-23. doi: 10.1007/s11042-023-14352-x.

A novel multiple kernel fuzzy topic modeling technique for biomedical data.一种用于生物医学数据的新型多核模糊主题建模技术。

BMC Bioinformatics. 2022 Jul 12;23(1):275. doi: 10.1186/s12859-022-04780-1.

Investigating Multi-Level Semantic Extraction with Squash Capsules for Short Text Classification.使用挤压胶囊进行短文本分类的多级语义提取研究

Entropy (Basel). 2022 Apr 23;24(5):590. doi: 10.3390/e24050590.

引用本文的文献

RCE-IFE: recursive cluster elimination with intra-cluster feature elimination.RCE-IFE：带簇内特征消除的递归簇消除

PeerJ Comput Sci. 2025 Feb 7;11:e2528. doi: 10.7717/peerj-cs.2528. eCollection 2025.

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

本文引用的文献

GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning.基因本体论：通过利用基于生物知识的机器学习对基因表达数据进行分组、评分和建模来识别受影响的基因本体术语。

Front Genet. 2023 Aug 21;14:1139082. doi: 10.3389/fgene.2023.1139082. eCollection 2023.

Invention of 3Mint for feature grouping and scoring in multi-omics.用于多组学中特征分组和评分的3Mint的发明。

Front Genet. 2023 Mar 15;14:1093326. doi: 10.3389/fgene.2023.1093326. eCollection 2023.

PriPath: identifying dysregulated pathways from differential gene expression via grouping, scoring, and modeling with an embedded feature selection approach.PriPath：通过分组、评分和建模，并结合嵌入式特征选择方法，从差异基因表达中识别失调途径。

BMC Bioinformatics. 2023 Feb 23;24(1):60. doi: 10.1186/s12859-023-05187-2.

miRdisNET: Discovering microRNA biomarkers that are associated with diseases utilizing biological knowledge-based machine learning.miRdisNET：利用基于生物学知识的机器学习发现与疾病相关的微小RNA生物标志物。

Front Genet. 2023 Jan 12;13:1076554. doi: 10.3389/fgene.2022.1076554. eCollection 2022.

GediNET for discovering gene associations across diseases using knowledge based machine learning approach.基于知识的机器学习方法发现疾病间基因关联的 GediNET。

Sci Rep. 2022 Nov 19;12(1):19955. doi: 10.1038/s41598-022-24421-0.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题：基于文本分类的词群分组作为主题及主题评分

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

miRModuleNet: Detecting miRNA-mRNA Regulatory Modules.miRModuleNet：检测微小RNA-信使核糖核酸调控模块

Front Genet. 2022 Apr 12;13:767455. doi: 10.3389/fgene.2022.767455. eCollection 2022.

miRcorrNet: machine learning-based integration of miRNA and mRNA expression profiles, combined with feature grouping and ranking.miRcorrNet：基于机器学习的miRNA和mRNA表达谱整合，结合特征分组与排序

PeerJ. 2021 May 19;9:e11458. doi: 10.7717/peerj.11458. eCollection 2021.

CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis.CogNet：基于面向排名活性子网的KEGG通路富集分析的基因表达数据分类

PeerJ Comput Sci. 2021 Feb 22;7:e336. doi: 10.7717/peerj-cs.336. eCollection 2021.

maTE: discovering expressed interactions between microRNAs and their targets.maTE：发现 microRNAs 与其靶标之间的表达相互作用。

Bioinformatics. 2019 Oct 15;35(20):4020-4028. doi: 10.1093/bioinformatics/btz204.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

TextNetTopics Pro，一种基于主题模型的短文本分类方法，通过整合语义和文档主题分布信息实现。

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献