Suppr超能文献

利用百科知识进行生物医学文献分类:一种基于维基百科的概念袋方法。

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

作者信息

Mouriño García Marcos Antonio, Pérez Rodríguez Roberto, Anido Rifón Luis E

机构信息

Department of Telematics Engineering, University of Vigo , Vigo , Spain.

出版信息

PeerJ. 2015 Sep 29;3:e1279. doi: 10.7717/peerj.1279. eCollection 2015.

Abstract

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria-that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text-thus suffering from synonymy and polysemy-and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge-concretely Wikipedia-in order to create bag-of-concepts (BoC) representations of documents, understanding concept as "unit of meaning", and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

摘要

将文本文档自动分类到一组类别中有很多应用。在这些应用中,生物医学文献的自动分类作为自动文档分类策略的一个重要应用脱颖而出。生物医学工作人员和研究人员在日常活动中必须处理大量文献,因此,一个能够以简单有效的方式访问感兴趣文档的系统将非常有用;因此,有必要根据某些标准对这些文档进行排序,也就是说,它们必须被分类。要分类的文档通常按照词袋(BoW)范式来表示。特征是文本中的单词,因此存在同义词和多义词问题,并且它们的权重仅基于其出现频率。本文提出了一项实证研究,研究一种利用百科全书背景知识(具体来说是维基百科)的分类器的效率,以便创建文档的概念袋(BoC)表示,将概念理解为“意义单元”,从而解决同义词和多义词问题。此外,概念的加权基于它们在文本中的语义相关性。为了评估该提议,使用了用于评估生物医学信息分类和检索的常用语料库之一OHSUMED,以及专门构建的MEDLINE生物医学摘要语料库UVigoMED进行了实证实验。获得的结果表明,基于维基百科的概念袋表示在OHSUMED语料库的单标签分类问题中比经典的词袋表示性能高出157%,在多标签问题中高出100%;在UVigoMED语料库的单标签分类问题中高出122%,在多标签问题中高出155%。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验