一种使用维基百科知识进行生物医学文档分类的概念包方法*。西班牙语-英语跨语言案例研究。

A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge*. Spanish-English Cross-language Case Study.

作者信息

Mouriño-García Marcos A, Pérez-Rodríguez Roberto, Anido-Rifón Luis E

机构信息

Department of Telematics Engineering, University of Vigo, Vigo, Spain

出版信息

Methods Inf Med. 2017 Oct 26;56(5):370-376. doi: 10.3414/ME17-01-0028. Epub 2017 Aug 16.

DOI:10.3414/ME17-01-0028

PMID:28816337

Abstract

OBJECTIVES

The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space.

METHODS

The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic annotator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts.

RESULTS

The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001.

CONCLUSION

Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers.

摘要

目标

高效回顾现有文献的能力对于研究的快速进展至关重要。本文描述了一种文本分类器，其将文本文档表示为维基百科概念空间中的向量，并分析了在仅有英文文档可用于训练时，该分类器对西班牙生物医学文档分类的适用性。我们提出了跨语言概念匹配（CLCM）技术，该技术依赖维基百科的跨语言链接将概念向量从西班牙语空间转换到英语空间。

方法

将该分类器的性能与几个基线进行比较：基于机器翻译的分类器、在执行显式语义分析（ESA）后表示文档的分类器以及使用特定领域语义注释器（MetaMap）的分类器。用于实验的语料库（跨语言维戈医学语料库）是专门为此研究构建的，它由12,832篇英文和2,184篇西班牙文的医学文献摘要组成。

结果

我们方法的性能优于基准中的任何其他现有分类器，性能提升高达：比经典机器翻译高124%，比MetaMap高332%，比基于ESA的分类器高60倍。结果具有统计学意义，p值<0.0001。

结论

利用从维基百科挖掘的知识将文档表示为维基百科概念空间中的向量，并在特定语言的概念空间之间转换向量，可以构建一个跨语言分类器，其性能优于几个现有分类器。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种使用维基百科知识进行生物医学文档分类的概念包方法*。西班牙语-英语跨语言案例研究。

A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge*. Spanish-English Cross-language Case Study.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目标

方法

结果

结论

相似文献

一种使用维基百科知识进行生物医学文档分类的概念包方法*。西班牙语-英语跨语言案例研究。

A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge*. Spanish-English Cross-language Case Study.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目标

方法

结果

结论

相似文献