文本网络主题：基于文本分类的词群分组作为主题及主题评分

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

作者信息

Yousef Malik, Voskergian Daniel

机构信息

Zefat Academic College, Zefat, Israel.

Computer Engineering Department, Al-Quds University, Jerusalem, Palestine.

出版信息

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

DOI:10.3389/fgene.2022.893378

PMID:35795215

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9251539/

Abstract

Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.

摘要

医学文档分类是文本分类领域中活跃的研究问题之一，也是最具挑战性的问题之一。医学数据集通常包含大量的特征集，其中许多特征被认为是不相关的、冗余的且会增加噪声，因此会降低分类性能。所以，为了获得更高的分类模型准确率，选择一组最能区分医学文档类别的特征（术语）至关重要。本研究提出了TextNetTopics，这是一种新颖的方法，它通过考虑主题袋（BOT）而不是传统的词袋（BOW）方法来进行特征选择。因此，我们的方法进行的是主题选择而非单词选择。TextNetTopics基于Yousef及其同事开发的名为G-S-M（分组、评分和建模）的通用方法，该方法主要用于生物数据。所提出的方法建议对主题进行评分，以选择用于训练分类器的顶级主题。本研究将TextNetTopics应用于文本数据以应对CAMDA挑战。TextNetTopics在将模型应用于CAMDA提供的验证数据时表现出色，同时优于各种特征选择方法。此外，我们还将我们的算法应用于不同的文本数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/86da/9251539/ea65384bb2af/fgene-13-893378-g001.jpg

相似文献

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题：基于文本分类的词群分组作为主题及主题评分

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.使用具有分组、评分和建模方法的集成主题建模进行文本分类的主题选择

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.TextNetTopics Pro，一种基于主题模型的短文本分类方法，通过整合语义和文档主题分布信息实现。

Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.

Topic2features: a novel framework to classify noisy and sparse textual data using LDA topic distributions.主题2特征：一种使用LDA主题分布对噪声和稀疏文本数据进行分类的新颖框架。

PeerJ Comput Sci. 2021 Aug 11;7:e677. doi: 10.7717/peerj-cs.677. eCollection 2021.

Improving the utility of MeSH® terms using the TopicalMeSH representation.使用主题词表（TopicalMeSH）表示法提高医学主题词表（MeSH®）术语的实用性。

J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19.

Supporting systematic reviews using LDA-based document representations.使用基于潜在狄利克雷分配（LDA）的文档表示法支持系统评价。

Syst Rev. 2015 Nov 26;4:172. doi: 10.1186/s13643-015-0117-0.

GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning.基因本体论：通过利用基于生物知识的机器学习对基因表达数据进行分组、评分和建模来识别受影响的基因本体术语。

Front Genet. 2023 Aug 21;14:1139082. doi: 10.3389/fgene.2023.1139082. eCollection 2023.

Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类：基于k近邻算法和基于词嵌入语义分析的方法。

J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

Using topic-noise models to generate domain-specific topics across data sources.使用主题-噪声模型跨数据源生成特定领域的主题。

Knowl Inf Syst. 2023;65(5):2159-2186. doi: 10.1007/s10115-022-01805-2. Epub 2023 Jan 16.

Local-LDA: Open-Ended Learning of Latent Topics for 3D Object Recognition.局部LDA：用于3D物体识别的潜在主题开放式学习

IEEE Trans Pattern Anal Mach Intell. 2020 Oct;42(10):2567-2580. doi: 10.1109/TPAMI.2019.2926459. Epub 2019 Jul 2.

引用本文的文献

Clinical decision support for vestibular diagnosis: large-scale machine learning with lived experience coaching.前庭诊断的临床决策支持：基于生活经验指导的大规模机器学习

NPJ Digit Med. 2025 Jul 31;8(1):487. doi: 10.1038/s41746-025-01880-z.

RCE-IFE: recursive cluster elimination with intra-cluster feature elimination.RCE-IFE：带簇内特征消除的递归簇消除

PeerJ Comput Sci. 2025 Feb 7;11:e2528. doi: 10.7717/peerj-cs.2528. eCollection 2025.

A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information.一种用于在没有标签信息的情况下分析和分类阿法尔奥罗莫语电子医疗文档的主题建模方法。

Sci Rep. 2024 Dec 30;14(1):32051. doi: 10.1038/s41598-024-83743-3.

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

microBiomeGSM: the identification of taxonomic biomarkers from metagenomic data using grouping, scoring and modeling (G-S-M) approach.微生物群落GSM：使用分组、评分和建模（G-S-M）方法从宏基因组数据中识别分类学生物标志物。

Front Microbiol. 2023 Nov 22;14:1264941. doi: 10.3389/fmicb.2023.1264941. eCollection 2023.

Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.

Front Genet. 2023 Aug 21;14:1139082. doi: 10.3389/fgene.2023.1139082. eCollection 2023.

Review of feature selection approaches based on grouping of features.基于特征分组的特征选择方法综述。

PeerJ. 2023 Jul 17;11:e15666. doi: 10.7717/peerj.15666. eCollection 2023.

Invention of 3Mint for feature grouping and scoring in multi-omics.用于多组学中特征分组和评分的3Mint的发明。

Front Genet. 2023 Mar 15;14:1093326. doi: 10.3389/fgene.2023.1093326. eCollection 2023.

miRdisNET: Discovering microRNA biomarkers that are associated with diseases utilizing biological knowledge-based machine learning.miRdisNET：利用基于生物学知识的机器学习发现与疾病相关的微小RNA生物标志物。

Front Genet. 2023 Jan 12;13:1076554. doi: 10.3389/fgene.2022.1076554. eCollection 2022.

本文引用的文献

miRModuleNet: Detecting miRNA-mRNA Regulatory Modules.miRModuleNet：检测微小RNA-信使核糖核酸调控模块

Front Genet. 2022 Apr 12;13:767455. doi: 10.3389/fgene.2022.767455. eCollection 2022.

miRcorrNet: machine learning-based integration of miRNA and mRNA expression profiles, combined with feature grouping and ranking.miRcorrNet：基于机器学习的miRNA和mRNA表达谱整合，结合特征分组与排序

PeerJ. 2021 May 19;9:e11458. doi: 10.7717/peerj.11458. eCollection 2021.

CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis.CogNet：基于面向排名活性子网的KEGG通路富集分析的基因表达数据分类

PeerJ Comput Sci. 2021 Feb 22;7:e336. doi: 10.7717/peerj-cs.336. eCollection 2021.

Recursive Cluster Elimination based Rank Function (SVM-RCE-R) implemented in KNIME.基于递归聚类消除的秩函数（SVM-RCE-R）在 KNIME 中的实现。

F1000Res. 2020 Oct 19;9:1255. doi: 10.12688/f1000research.26880.2. eCollection 2020.

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data.基于生物领域知识的特征选择在基因表达数据中的应用。

Entropy (Basel). 2020 Dec 22;23(1):2. doi: 10.3390/e23010002.

maTE: discovering expressed interactions between microRNAs and their targets.maTE：发现 microRNAs 与其靶标之间的表达相互作用。

Bioinformatics. 2019 Oct 15;35(20):4020-4028. doi: 10.1093/bioinformatics/btz204.

Supporting systematic reviews using LDA-based document representations.使用基于潜在狄利克雷分配（LDA）的文档表示法支持系统评价。

Syst Rev. 2015 Nov 26;4:172. doi: 10.1186/s13643-015-0117-0.

Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management.基于拉普拉斯支持向量机的半监督临床文本分类：在癌症病例管理中的应用。

J Biomed Inform. 2013 Oct;46(5):869-75. doi: 10.1016/j.jbi.2013.06.014. Epub 2013 Jul 8.

Classification and biomarker identification using gene network modules and support vector machines.基于基因网络模块和支持向量机的分类和生物标志物识别。

BMC Bioinformatics. 2009 Oct 15;10:337. doi: 10.1186/1471-2105-10-337.

A review of feature selection techniques in bioinformatics.生物信息学中特征选择技术综述。

Bioinformatics. 2007 Oct 1;23(19):2507-17. doi: 10.1093/bioinformatics/btm344. Epub 2007 Aug 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

文本网络主题：基于文本分类的词群分组作为主题及主题评分

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献