Department of Computer Engineering, Abdullah Gul University, Barbaros, Erkilet Blvd. Sumer Campus, Kayseri, 38080, Turkey.
Department of Computer Engineering, Abdullah Gul University, Barbaros, Erkilet Blvd. Sumer Campus, Kayseri, 38080, Turkey.
Comput Biol Med. 2024 Aug;178:108721. doi: 10.1016/j.compbiomed.2024.108721. Epub 2024 Jun 19.
Since the 2000s, digitalization has been a crucial transformation in our lives. Nevertheless, digitalization brings a bulk of unstructured textual data to be processed, including articles, clinical records, web pages, and shared social media posts. As a critical analysis, the classification task classifies the given textual entities into correct categories. Categorizing documents from different domains is straightforward since the instances are unlikely to contain similar contexts. However, document classification in a single domain is more complicated due to sharing the same context. Thus, we aim to classify medical articles about four common cancer types (Leukemia, Non-Hodgkin Lymphoma, Bladder Cancer, and Thyroid Cancer) by constructing machine learning and deep learning models. We used 383,914 medical articles about four common cancer types collected by the PubMed API. To build classification models, we split the dataset into 70% as training, 20% as testing, and 10% as validation. We built widely used machine-learning (Logistic Regression, XGBoost, CatBoost, and Random Forest Classifiers) and modern deep-learning (convolutional neural networks - CNN, long short-term memory - LSTM, and gated recurrent unit - GRU) models. We computed the average classification performances (precision, recall, F-score) to evaluate the models over ten distinct dataset splits. The best-performing deep learning model(s) yielded a superior F1 score of 98%. However, traditional machine learning models also achieved reasonably high F1 scores, 95% for the worst-performing case. Ultimately, we constructed multiple models to classify articles, which compose a hard-to-classify dataset in the medical domain.
自 2000 年代以来,数字化已经成为我们生活中的关键转型。然而,数字化带来了大量需要处理的非结构化文本数据,包括文章、临床记录、网页和共享社交媒体帖子。作为一种关键分析,分类任务将给定的文本实体分类到正确的类别中。对不同领域的文档进行分类很简单,因为实例不太可能包含相似的上下文。但是,在单一领域对文档进行分类更加复杂,因为它们共享相同的上下文。因此,我们旨在通过构建机器学习和深度学习模型来对四种常见癌症(白血病、非霍奇金淋巴瘤、膀胱癌和甲状腺癌)的医学文章进行分类。我们使用了通过 PubMed API 收集的关于四种常见癌症的 383914 篇医学文章来构建分类模型。为了构建分类模型,我们将数据集分为 70%用于训练,20%用于测试,10%用于验证。我们构建了广泛使用的机器学习(逻辑回归、XGBoost、CatBoost 和随机森林分类器)和现代深度学习(卷积神经网络 - CNN、长短时记忆 - LSTM 和门控循环单元 - GRU)模型。我们计算了平均分类性能(精度、召回率、F1 分数),以评估模型在十次不同数据集分割中的表现。表现最好的深度学习模型(多个)产生了 98%的卓越 F1 分数。然而,传统的机器学习模型也取得了相当高的 F1 分数,最差情况下为 95%。最终,我们构建了多个模型来对文章进行分类,这些模型构成了医学领域中难以分类的数据集。