Trieschnigg Dolf, Pezik Piotr, Lee Vivian, de Jong Franciska, Kraaij Wessel, Rebholz-Schuhmann Dietrich
European Bioinformatics Institute, Hinxton, UK.
Bioinformatics. 2009 Jun 1;25(11):1412-8. doi: 10.1093/bioinformatics/btp249. Epub 2009 Apr 17.
Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems.
We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone.
The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.
诸如医学主题词表(MeSH)和基因本体论(GO)之类的受控词汇表,通过减少自由文本数据中固有的歧义性,提供了一种访问和组织生物医学信息的有效方法。已经提出了不同的自动分配MeSH概念的方法来取代人工注释,但它们要么仅限于MeSH的一个小子集,要么仅与有限数量的其他系统进行了比较。
我们比较了六个MeSH分类系统[MetaMap、EAGL、一种基于语言和向量空间模型的方法、一种K近邻(KNN)方法和MTI]在重现和补充人工MeSH注释方面的性能。一个KNN系统明显优于其他已发表的方法,并且使用完整的MeSH词表对大量文本具有良好的扩展性。我们的测量结果表明了人工MeSH注释能够被重现的程度以及它们如何能够通过自动注释得到补充。我们还表明,与仅使用原始文本查询相比,当用户查询文本用MeSH概念自动注释时,在信息检索(IR)方面可以获得具有统计学意义的改进。
使用诸如MeSH之类的受控词汇表对生物医学文本进行注释可以实现自动化,以改进纯文本IR。此外,我们提出的自动MeSH注释系统具有高度的可扩展性,并且它在IR方面产生的改进与人工注释所观察到的改进相当。