Yousef Malik, Voskergian Daniel
Zefat Academic College, Zefat, Israel.
Computer Engineering Department, Al-Quds University, Jerusalem, Palestine.
Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.
Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.
医学文档分类是文本分类领域中活跃的研究问题之一,也是最具挑战性的问题之一。医学数据集通常包含大量的特征集,其中许多特征被认为是不相关的、冗余的且会增加噪声,因此会降低分类性能。所以,为了获得更高的分类模型准确率,选择一组最能区分医学文档类别的特征(术语)至关重要。本研究提出了TextNetTopics,这是一种新颖的方法,它通过考虑主题袋(BOT)而不是传统的词袋(BOW)方法来进行特征选择。因此,我们的方法进行的是主题选择而非单词选择。TextNetTopics基于Yousef及其同事开发的名为G-S-M(分组、评分和建模)的通用方法,该方法主要用于生物数据。所提出的方法建议对主题进行评分,以选择用于训练分类器的顶级主题。本研究将TextNetTopics应用于文本数据以应对CAMDA挑战。TextNetTopics在将模型应用于CAMDA提供的验证数据时表现出色,同时优于各种特征选择方法。此外,我们还将我们的算法应用于不同的文本数据集。