利用百科知识进行生物医学文献分类：一种基于维基百科的概念袋方法。

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

作者信息

Mouriño García Marcos Antonio, Pérez Rodríguez Roberto, Anido Rifón Luis E

机构信息

Department of Telematics Engineering, University of Vigo , Vigo , Spain.

出版信息

PeerJ. 2015 Sep 29;3:e1279. doi: 10.7717/peerj.1279. eCollection 2015.

DOI:10.7717/peerj.1279

PMID:26468436

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4592155/

Abstract

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria-that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text-thus suffering from synonymy and polysemy-and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge-concretely Wikipedia-in order to create bag-of-concepts (BoC) representations of documents, understanding concept as "unit of meaning", and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

摘要

将文本文档自动分类到一组类别中有很多应用。在这些应用中，生物医学文献的自动分类作为自动文档分类策略的一个重要应用脱颖而出。生物医学工作人员和研究人员在日常活动中必须处理大量文献，因此，一个能够以简单有效的方式访问感兴趣文档的系统将非常有用；因此，有必要根据某些标准对这些文档进行排序，也就是说，它们必须被分类。要分类的文档通常按照词袋（BoW）范式来表示。特征是文本中的单词，因此存在同义词和多义词问题，并且它们的权重仅基于其出现频率。本文提出了一项实证研究，研究一种利用百科全书背景知识（具体来说是维基百科）的分类器的效率，以便创建文档的概念袋（BoC）表示，将概念理解为“意义单元”，从而解决同义词和多义词问题。此外，概念的加权基于它们在文本中的语义相关性。为了评估该提议，使用了用于评估生物医学信息分类和检索的常用语料库之一OHSUMED，以及专门构建的MEDLINE生物医学摘要语料库UVigoMED进行了实证实验。获得的结果表明，基于维基百科的概念袋表示在OHSUMED语料库的单标签分类问题中比经典的词袋表示性能高出157%，在多标签问题中高出100%；在UVigoMED语料库的单标签分类问题中高出122%，在多标签问题中高出155%。

相似文献

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.利用百科知识进行生物医学文献分类：一种基于维基百科的概念袋方法。

PeerJ. 2015 Sep 29;3:e1279. doi: 10.7717/peerj.1279. eCollection 2015.

Leveraging Wikipedia knowledge to classify multilingual biomedical documents.利用维基百科知识对多语言生物医学文献进行分类。

Artif Intell Med. 2018 Jun;88:37-57. doi: 10.1016/j.artmed.2018.04.007. Epub 2018 May 3.

A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge*. Spanish-English Cross-language Case Study.一种使用维基百科知识进行生物医学文档分类的概念包方法*。西班牙语-英语跨语言案例研究。

Methods Inf Med. 2017 Oct 26;56(5):370-376. doi: 10.3414/ME17-01-0028. Epub 2017 Aug 16.

Classification of forensic autopsy reports through conceptual graph-based document representation model.基于概念图的文档表示模型对法医解剖报告的分类。

J Biomed Inform. 2018 Jun;82:88-105. doi: 10.1016/j.jbi.2018.04.013. Epub 2018 May 5.

Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类：基于k近邻算法和基于词嵌入语义分析的方法。

J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

The effect of feature representation on MEDLINE document classification.特征表示对医学文献数据库（MEDLINE）文档分类的影响。

AMIA Annu Symp Proc. 2005;2005:849-53.

Classification of Biomedical Texts for Cardiovascular Diseases with Deep Neural Network Using a Weighted Feature Representation Method.基于加权特征表示方法的深度神经网络用于心血管疾病生物医学文本分类

Healthcare (Basel). 2020 Oct 10;8(4):392. doi: 10.3390/healthcare8040392.

Representing Documents via Latent Keyphrase Inference.通过潜在关键短语推理来表示文档。

Proc Int World Wide Web Conf. 2016 Apr;2016:1057-1067. doi: 10.1145/2872427.2883088.

Improving the utility of MeSH® terms using the TopicalMeSH representation.使用主题词表（TopicalMeSH）表示法提高医学主题词表（MeSH®）术语的实用性。

J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19.

A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora.面向文本语料概念化的概念驱动生物医学知识提取和可视化框架。

J Biomed Inform. 2010 Dec;43(6):1020-35. doi: 10.1016/j.jbi.2010.09.008. Epub 2010 Sep 24.

引用本文的文献

Pandemic tele-smart: a contactless tele-health system for efficient monitoring of remotely located COVID-19 quarantine wards in India using near-field communication and natural language processing system.大流行智能远程医疗：利用近场通信和自然语言处理系统对印度远程 COVID-19 隔离病房进行高效监测的无接触远程医疗系统。

Med Biol Eng Comput. 2022 Jan;60(1):61-79. doi: 10.1007/s11517-021-02456-1. Epub 2021 Oct 27.

Utilizing image and caption information for biomedical document classification.利用图像和标题信息进行生物医学文献分类。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i468-i476. doi: 10.1093/bioinformatics/btab331.

tESA: a distributional measure for calculating semantic relatedness.tESA：一种用于计算语义相关性的分布度量。

J Biomed Semantics. 2016 Dec 28;7(1):67. doi: 10.1186/s13326-016-0109-6.

PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。

J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.

本文引用的文献

Using an ensemble system to improve concept extraction from clinical records.利用集成系统提高从临床记录中提取概念的能力。

J Biomed Inform. 2012 Jun;45(3):423-8. doi: 10.1016/j.jbi.2011.12.009. Epub 2012 Jan 3.

The open biomedical annotator.开放式生物医学注释工具

Summit Transl Bioinform. 2009 Mar 1;2009:56-60.

The effect of feature representation on MEDLINE document classification.特征表示对医学文献数据库（MEDLINE）文档分类的影响。

AMIA Annu Symp Proc. 2005;2005:849-53.

Identifying biological concepts from a protein-related corpus with a probabilistic topic model.使用概率主题模型从蛋白质相关语料库中识别生物学概念。

BMC Bioinformatics. 2006 Feb 8;7:58. doi: 10.1186/1471-2105-7-58.

The Unified Medical Language System (UMLS): integrating biomedical terminology.统一医学语言系统（UMLS）：整合生物医学术语。

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. doi: 10.1093/nar/gkh061.

Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.生物医学文本到UMLS元词表的有效映射：MetaMap程序

Proc AMIA Symp. 2001:17-21.

Medical Subject Headings (MeSH).医学主题词表（MeSH）。

Bull Med Libr Assoc. 2000 Jul;88(3):265-6.

Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches.理解并使用医学主题词表（MeSH）词汇进行文献检索。

JAMA. 1994 Apr 13;271(14):1103-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用百科知识进行生物医学文献分类：一种基于维基百科的概念袋方法。

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献