Department of Telematics Engineering, University of Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain.
Artif Intell Med. 2018 Jun;88:37-57. doi: 10.1016/j.artmed.2018.04.007. Epub 2018 May 3.
This article presents a classifier that leverages Wikipedia knowledge to represent documents as vectors of concepts weights, and analyses its suitability for classifying biomedical documents written in any language when it is trained only with English documents. We propose the cross-language concept matching technique, which relies on Wikipedia interlanguage links to convert concept vectors between languages. The performance of the classifier is compared to a classifier based on machine translation, and two classifiers based on MetaMap. To perform the experiments, we created two multilingual corpus. The first one, Multi-Lingual UVigoMED (ML-UVigoMED) is composed of 23,647 Wikipedia documents about biomedical topics written in English, German, French, Spanish, Italian, Galician, Romanian, and Icelandic. The second one, English-French-Spanish-German UVigoMED (EFSG-UVigoMED) is composed of 19,210 biomedical abstract extracted from MEDLINE written in English, French, Spanish, and German. The performance of the approach proposed is superior to any of the state-of-the art classifier in the benchmark. We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of multilingual classification of biomedical documents.
本文提出了一种分类器,利用维基百科知识将文档表示为概念权重的向量,并分析了当仅使用英语文档进行训练时,该分类器在对任何语言编写的生物医学文档进行分类时的适用性。我们提出了跨语言概念匹配技术,该技术依赖于维基百科的语言间链接在语言之间转换概念向量。将该分类器的性能与基于机器翻译的分类器和基于 MetaMap 的两个分类器进行了比较。为了进行实验,我们创建了两个多语言语料库。第一个是多语言 UVigoMED(ML-UVigoMED),它由 23647 篇关于生物医学主题的英文、德文、法文、西班牙文、意大利文、加利西亚文、罗马尼亚文和冰岛文的维基百科文档组成。第二个是英语-法语-西班牙语-德语 UVigoMED(EFSG-UVigoMED),它由从 MEDLINE 提取的 19210 篇生物医学摘要组成,这些摘要分别用英文、法文、西班牙文和德文撰写。所提出方法的性能优于基准测试中的任何一种最先进的分类器。我们得出结论,利用维基百科知识在生物医学文档的多语言分类任务中具有很大的优势。