开发一个词性人工标注的临床笔记语料库。

Developing a corpus of clinical notes manually annotated for part-of-speech.

作者信息

Pakhomov Serguei V, Coden Anni, Chute Christopher G

机构信息

Division of Biomedical Informatics, Mayo College of Medicine, Rochester, MN 55905, USA.

出版信息

Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.

DOI:10.1016/j.ijmedinf.2005.08.006

PMID:16169769

Abstract

PURPOSE

This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation.

METHODS

Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation.

RESULTS

We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging.

CONCLUSION

Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

摘要

目的

本文介绍了一个项目，其主要目标是构建一个经过词性（POS）信息人工标注的临床文本语料库。我们描述并讨论了培训三位领域专家进行语言标注的过程。

方法

培训三位领域专家对临床笔记语料库进行人工标注。该语料库的一部分与通用英语文本的宾州树库语料库相结合，另一部分留作测试之用。然后将这些语料库用于训练和测试统计词性标注器。我们列出了一些与评分者间一致性和标注一致性相关的挑战以及令人鼓舞的结果。

结果

我们使用在通用英语数据上训练的三元组与标签（TnT）[T. 布兰特斯，TnT - 一个统计词性标注器，载于：《NAACL/ANLP - 2000研讨会论文集》，2000年]标注器，其正确率达到89.79%。在为本项目标注的部分医学数据上训练的同一标注器将性能提高到了94.69%。此外，我们发现区分由临床文本不同部分所代表的不同类型话语可能对提高词性标注的正确率非常有益。

结论

我们的初步实验结果表明，有必要使最先进的词性标注器适应临床文本的子语言领域。

相似文献

Developing a corpus of clinical notes manually annotated for part-of-speech.开发一个词性人工标注的临床笔记语料库。

Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.

Comparison of character-level and part of speech features for name recognition in biomedical texts.生物医学文本中用于名称识别的字符级特征与词性特征比较。

J Biomed Inform. 2004 Dec;37(6):423-35. doi: 10.1016/j.jbi.2004.08.008.

Zone analysis in biology articles as a basis for information extraction.生物学文章中的区域分析作为信息提取的基础。

Int J Med Inform. 2006 Jun;75(6):468-87. doi: 10.1016/j.ijmedinf.2005.06.013. Epub 2005 Aug 19.

Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称：一种机器学习方法。

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Really, is medical sublanguage that different? Experimental counter-evidence from tagging medical and newspaper corpora.真的，医学子语言有那么不同吗？来自标注医学语料库和报纸语料库的实验性反证。

Stud Health Technol Inform. 2004;107(Pt 1):560-4.

Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions.针对蛋白质-蛋白质相互作用的生物医学语料库对两种依存句法分析器的评估。

Int J Med Inform. 2006 Jun;75(6):430-42. doi: 10.1016/j.ijmedinf.2005.06.009. Epub 2005 Aug 11.

Performance analysis of a POS tagger applied to discharge summaries in Portuguese.应用于葡萄牙语出院小结的词性标注器性能分析。

Stud Health Technol Inform. 2010;160(Pt 2):959-63.

Distributed modules for text annotation and IE applied to the biomedical domain.应用于生物医学领域的文本注释和信息提取的分布式模块。

Int J Med Inform. 2006 Jun;75(6):496-500. doi: 10.1016/j.ijmedinf.2005.06.011. Epub 2005 Aug 8.

The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text.自然语言处理中领域知识与语言结构的相互作用：解读生物医学文本中的上位命题

J Biomed Inform. 2003 Dec;36(6):462-77. doi: 10.1016/j.jbi.2003.11.003.

Domain-specific language models and lexicons for tagging.用于标记的特定领域语言模型和词汇表。

J Biomed Inform. 2005 Dec;38(6):422-30. doi: 10.1016/j.jbi.2005.02.009. Epub 2005 Apr 2.

引用本文的文献

A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling.临床笔记中移动功能信息的综合研究：实体层次结构、语料库标注和序列标记。

Int J Med Inform. 2021 Mar;147:104351. doi: 10.1016/j.ijmedinf.2020.104351. Epub 2020 Dec 24.

Design of an extensive information representation scheme for clinical narratives.临床叙述的广泛信息表示方案设计

J Biomed Semantics. 2017 Sep 11;8(1):37. doi: 10.1186/s13326-017-0135-z.

Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.通过领域自适应提高临床叙述自然语言处理词性标注的性能。

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):931-9. doi: 10.1136/amiajnl-2012-001453. Epub 2013 Mar 13.

Part-of-speech tagging for clinical text: wall or bridge between institutions?临床文本的词性标注：机构之间的壁垒还是桥梁？

AMIA Annu Symp Proc. 2011;2011:382-91. Epub 2011 Oct 22.

Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies.芬兰和瑞典重症监护护理叙事的特点：支持临床语言技术发展的比较分析

J Biomed Semantics. 2011;2 Suppl 3(Suppl 3):S1. doi: 10.1186/2041-1480-2-S3-S1. Epub 2011 Jul 14.

Quantitative analysis of ontology research articles in the radiologic domain.放射学领域本体研究文章的定量分析。

Radiol Phys Technol. 2010 Jul;3(2):171-7. doi: 10.1007/s12194-010-0094-x. Epub 2010 May 22.

What can natural language processing do for clinical decision support?自然语言处理能为临床决策支持做些什么？

J Biomed Inform. 2009 Oct;42(5):760-72. doi: 10.1016/j.jbi.2009.08.007. Epub 2009 Aug 13.

Agreement between patient-reported symptoms and their documentation in the medical record.患者报告的症状与其病历记录之间的一致性。

Am J Manag Care. 2008 Aug;14(8):530-9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

开发一个词性人工标注的临床笔记语料库。

Developing a corpus of clinical notes manually annotated for part-of-speech.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献