Suppr超能文献

文本网络主题:基于文本分类的词群分组作为主题及主题评分

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

作者信息

Yousef Malik, Voskergian Daniel

机构信息

Zefat Academic College, Zefat, Israel.

Computer Engineering Department, Al-Quds University, Jerusalem, Palestine.

出版信息

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

Abstract

Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.

摘要

医学文档分类是文本分类领域中活跃的研究问题之一,也是最具挑战性的问题之一。医学数据集通常包含大量的特征集,其中许多特征被认为是不相关的、冗余的且会增加噪声,因此会降低分类性能。所以,为了获得更高的分类模型准确率,选择一组最能区分医学文档类别的特征(术语)至关重要。本研究提出了TextNetTopics,这是一种新颖的方法,它通过考虑主题袋(BOT)而不是传统的词袋(BOW)方法来进行特征选择。因此,我们的方法进行的是主题选择而非单词选择。TextNetTopics基于Yousef及其同事开发的名为G-S-M(分组、评分和建模)的通用方法,该方法主要用于生物数据。所提出的方法建议对主题进行评分,以选择用于训练分类器的顶级主题。本研究将TextNetTopics应用于文本数据以应对CAMDA挑战。TextNetTopics在将模型应用于CAMDA提供的验证数据时表现出色,同时优于各种特征选择方法。此外,我们还将我们的算法应用于不同的文本数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/86da/9251539/ea65384bb2af/fgene-13-893378-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验