Suppr超能文献

用于生物医学概念识别的多语言金标准语料库:Mantra GSC。

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.

作者信息

Kors Jan A, Clematide Simon, Akhondi Saber A, van Mulligen Erik M, Rebholz-Schuhmann Dietrich

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands

Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland.

出版信息

J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.

Abstract

OBJECTIVE

To create a multilingual gold-standard corpus for biomedical concept recognition.

MATERIALS AND METHODS

We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations.

RESULTS

The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language.

DISCUSSION

The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques.

CONCLUSION

To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.

摘要

目的

创建一个用于生物医学概念识别的多语言金标准语料库。

材料与方法

我们从英文、法文、德文、西班牙文和荷兰文的不同平行语料库(Medline摘要标题、药品标签、生物医学专利声明)中选取文本单元。每种语言由三名注释者基于统一医学语言系统的一个子集并涵盖广泛语义组独立注释生物医学概念。为减少注释工作量,提供了自动生成的预注释。对个体注释进行自动协调然后裁决,并进行跨语言一致性检查以得出最终注释。

结果

最终注释的数量为5530条。注释者间的一致性得分表明一致性良好(中位数F值为0.79),且与个体注释者和金标准之间的得分相似。为每种语言自动生成的协调注释集与该语言最佳注释者的表现相当。

讨论

自动预注释、协调注释和平行语料库的使用有助于使人工注释工作可控。注释者间的一致性得分提供了衡量自动注释技术性能的参考标准。

结论

据我们所知,这是首个除英语外用于生物医学概念识别的金标准语料库。其他显著特点是涵盖的语义组种类广泛以及注释的文本体裁多样。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ebc/4986661/ec20f5ffd1e2/ocv037f1p.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验