Wagholikar Kavishwar, Torii Manabu, Jonnalagadda Siddhartha, Liu Hongfang
Mayo Clinic, Rochester, MN;
AMIA Jt Summits Transl Sci Proc. 2012;2012:38. Epub 2012 Mar 19.
Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.
带注释语料库的可用性促进了机器学习算法在从临床记录中提取概念方面的应用。然而,在各个机构中准备带注释语料库成本高昂,而汇集其他机构的带注释语料库是一种潜在的解决方案。在本文中,我们研究了汇集来自两个不同来源的语料库是否可以提高用于医疗问题检测的机器学习标记器的性能和可移植性。具体而言,我们汇集了来自2010年i2b2/VA自然语言处理挑战赛和梅奥诊所罗切斯特分院的语料库,以评估用于识别医疗问题的标记器。与我们的预期相反,发现汇集语料库会降低F1分数。我们检查注释指南以确定语料库不兼容的因素,并建议临床自然语言处理社区制定标准注释指南,以实现带注释语料库的兼容性。