• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用机器学习算法对新冠病毒疾病文档进行多类别分类。

Multi-class classification of COVID-19 documents using machine learning algorithms.

作者信息

Rabby Gollam, Berka Petr

机构信息

Department of Information and Knowledge Engineering, Prague University of Economics and Business, Prague, Czech Republic.

出版信息

J Intell Inf Syst. 2023;60(2):571-591. doi: 10.1007/s10844-022-00768-8. Epub 2022 Nov 29.

DOI:10.1007/s10844-022-00768-8
PMID:36465147
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9707112/
Abstract

In most biomedical research paper corpus, document classification is a crucial task. Even due to the global epidemic, it is a crucial task for researchers across a variety of fields to figure out the relevant scientific research papers accurately and quickly from a flood of biomedical research papers. It can also assist learners or researchers in assigning a research paper to an appropriate category and also help to find the relevant research paper within a very short time. A biomedical document classifier needs to be designed differently to go beyond a "general" text classifier because it's not dependent only on the text itself (i.e. on titles and abstracts) but can also utilize other information like entities extracted using some medical taxonomies or bibliometric data. The main objective of this research was to find out the type of information or features and representation method creates influence the biomedical document classification task. For this reason, we run several experiments on conventional text classification methods with different kinds of features extracted from the titles, abstracts, and bibliometric data. These procedures include data cleaning, feature engineering, and multi-class classification. Eleven different variants of input data tables were created and analyzed using ten machine learning algorithms. We also evaluate the data efficiency and interpretability of these models as essential features of any biomedical research paper classification system for handling specifically the COVID-19 related health crisis. Our major findings are that TF-IDF representations outperform the entity extraction methods and the abstract itself provides sufficient information for correct classification. Out of the used machine learning algorithms, the best performance over various forms of document representation was achieved by Random Forest and Neural Network (BERT). Our results lead to a concrete guideline for practitioners on biomedical document classification.

摘要

在大多数生物医学研究论文语料库中,文档分类是一项至关重要的任务。即使是由于全球疫情,对于各个领域的研究人员来说,从大量生物医学研究论文中准确快速地找出相关科研论文也是一项至关重要的任务。它还可以帮助学习者或研究人员将一篇研究论文归入适当的类别,并有助于在很短的时间内找到相关的研究论文。生物医学文档分类器需要进行不同的设计,以超越“一般”的文本分类器,因为它不仅依赖于文本本身(即标题和摘要),还可以利用其他信息,如使用一些医学分类法提取的实体或文献计量数据。本研究的主要目的是找出信息或特征的类型以及表示方法对生物医学文档分类任务的影响。因此,我们对从标题、摘要和文献计量数据中提取的不同类型特征的传统文本分类方法进行了多次实验。这些步骤包括数据清理、特征工程和多类分类。使用十种机器学习算法创建并分析了十一种不同变体的输入数据表。我们还将这些模型的数据效率和可解释性评估为任何生物医学研究论文分类系统处理特别是与COVID-19相关的健康危机的基本特征。我们的主要发现是,TF-IDF表示优于实体提取方法,并且摘要本身为正确分类提供了足够的信息。在所使用的机器学习算法中,随机森林和神经网络(BERT)在各种形式的文档表示上表现最佳。我们的结果为生物医学文档分类的从业者提供了具体的指导方针。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/fa3746862efa/10844_2022_768_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/13e468af0df7/10844_2022_768_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/04055336a7f0/10844_2022_768_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/c5c6c79ddd50/10844_2022_768_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/ba660c34b8bc/10844_2022_768_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/ef039953fe77/10844_2022_768_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/03d865eba597/10844_2022_768_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/fa3746862efa/10844_2022_768_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/13e468af0df7/10844_2022_768_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/04055336a7f0/10844_2022_768_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/c5c6c79ddd50/10844_2022_768_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/ba660c34b8bc/10844_2022_768_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/ef039953fe77/10844_2022_768_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/03d865eba597/10844_2022_768_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bab9/9707112/fa3746862efa/10844_2022_768_Fig7_HTML.jpg

相似文献

1
Multi-class classification of COVID-19 documents using machine learning algorithms.使用机器学习算法对新冠病毒疾病文档进行多类别分类。
J Intell Inf Syst. 2023;60(2):571-591. doi: 10.1007/s10844-022-00768-8. Epub 2022 Nov 29.
2
Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.新冠疫情研究的影响:一项使用机器学习和领域无关知识图谱预测有影响力学术文献的研究。
J Biomed Semantics. 2023 Nov 28;14(1):18. doi: 10.1186/s13326-023-00298-4.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Why was this cited? Explainable machine learning applied to COVID-19 research literature.为什么引用这个?可解释机器学习应用于新冠疫情研究文献。
Scientometrics. 2022;127(5):2313-2349. doi: 10.1007/s11192-022-04314-9. Epub 2022 Apr 9.
5
BertSRC: transformer-based semantic relation classification.BertSRC:基于转换器的语义关系分类。
BMC Med Inform Decis Mak. 2022 Sep 6;22(1):234. doi: 10.1186/s12911-022-01977-5.
6
TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.TextNetTopics Pro,一种基于主题模型的短文本分类方法,通过整合语义和文档主题分布信息实现。
Front Genet. 2023 Oct 5;14:1243874. doi: 10.3389/fgene.2023.1243874. eCollection 2023.
7
Pediatric Injury Surveillance From Uncoded Emergency Department Admission Records in Italy: Machine Learning-Based Text-Mining Approach.意大利基于无编码急诊入院记录的儿科伤害监测:基于机器学习的文本挖掘方法。
JMIR Public Health Surveill. 2023 Jul 12;9:e44467. doi: 10.2196/44467.
8
Improving the utility of MeSH® terms using the TopicalMeSH representation.使用主题词表(TopicalMeSH)表示法提高医学主题词表(MeSH®)术语的实用性。
J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19.
9
A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
10
Integrating image caption information into biomedical document classification in support of biocuration.将图像标题信息整合到生物医学文献分类中,以支持生物注释。
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa024.

引用本文的文献

1
Genetic Algorithms for Feature Selection in the Classification of COVID-19 Patients.用于新冠肺炎患者分类中特征选择的遗传算法
Bioengineering (Basel). 2024 Sep 23;11(9):952. doi: 10.3390/bioengineering11090952.
2
Towards Improved XAI-Based Epidemiological Research into the Next Potential Pandemic.迈向基于可解释人工智能的流行病学研究,以应对下一次潜在的大流行。
Life (Basel). 2024 Jun 21;14(7):783. doi: 10.3390/life14070783.

本文引用的文献

1
Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations.生物医学文献的多标签分类:BioCreative VII LitCovid 新冠文献主题标注挑战赛概述。
Database (Oxford). 2022 Aug 31;2022. doi: 10.1093/database/baac069.
2
Why was this cited? Explainable machine learning applied to COVID-19 research literature.为什么引用这个?可解释机器学习应用于新冠疫情研究文献。
Scientometrics. 2022;127(5):2313-2349. doi: 10.1007/s11192-022-04314-9. Epub 2022 Apr 9.
3
LitCovid: an open database of COVID-19 literature.
LitCovid:一个 COVID-19 文献的开放数据库。
Nucleic Acids Res. 2021 Jan 8;49(D1):D1534-D1540. doi: 10.1093/nar/gkaa952.
4
From Local Explanations to Global Understanding with Explainable AI for Trees.利用可解释人工智能实现从局部解释到树木的全局理解
Nat Mach Intell. 2020 Jan;2(1):56-67. doi: 10.1038/s42256-019-0138-9. Epub 2020 Jan 17.
5
Biomedical literature classification with a CNNs-based hybrid learning network.基于 CNNs 的混合学习网络的生物医学文献分类。
PLoS One. 2018 Jul 26;13(7):e0197933. doi: 10.1371/journal.pone.0197933. eCollection 2018.
6
Understanding logistic regression analysis.理解逻辑回归分析。
Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi: 10.11613/BM.2014.003. eCollection 2014.