Suppr超能文献

合并带注释语料库用于临床概念提取。

Pooling annotated corpora for clinical concept extraction.

作者信息

Wagholikar Kavishwar B, Torii Manabu, Jonnalagadda Siddhartha R, Liu Hongfang

机构信息

Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA.

出版信息

J Biomed Semantics. 2013 Jan 8;4(1):3. doi: 10.1186/2041-1480-4-3.

Abstract

BACKGROUND

The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions.

RESULTS

We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling.

CONCLUSIONS

The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that - i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.

摘要

背景

带注释语料库的可用性促进了机器学习算法在从临床记录中提取概念方面的应用。然而,创建这些注释需要高昂的成本和大量人力。一种潜在的替代方法是通过与本地语料库合并来复用其他机构现有的语料库,用于训练机器标记器。在本文中,我们通过合并2010年i2b2/VA自然语言处理挑战赛和梅奥诊所罗切斯特分院的语料库,研究了后一种方法,以评估用于识别医疗问题的标记器。这些语料库针对医疗问题进行了注释,但遵循不同的指南。标记器是使用现有的标记系统MedTagger构建的,该系统包括字典查找、词性(POS)标记以及用于命名实体预测和概念提取的机器学习。我们希望我们目前的工作将成为促进跨机构复用带注释语料库的一个有用案例研究。

结果

我们发现,当本地语料库规模较小时,并且在一些指南差异得到协调之后,合并是有效的。然而,随着更多本地注释文档被纳入训练数据,合并的益处会减少。我们检查了注释指南,以确定决定合并效果的因素。

结论

合并语料库的有效性取决于几个因素,包括注释指南的兼容性、报告类型的分布以及本地和外部语料库的规模。纠正一些指南差异的简单方法可以促进合并。我们的发现需要通过对不同语料库的进一步研究来证实。为了促进带注释语料库的合并和复用,我们建议:i)自然语言处理社区应制定一个标准注释指南,解决本文中部分确定的指南差异潜在领域;ii)语料库应以两遍法进行注释,首先专注于概念识别,然后进行现有本体的规范化;iii)在注释过程中应创建诸如报告类型等元数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a597/3599895/017561af3e92/2041-1480-4-3-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验