Suppr超能文献

朝着临床叙述的全面句法和语义标注努力。

Towards comprehensive syntactic and semantic annotations of the clinical narrative.

机构信息

Department of Linguistics, University of Colorado, Boulder, Colorado, USA.

出版信息

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):922-30. doi: 10.1136/amiajnl-2012-001317. Epub 2013 Jan 25.

Abstract

OBJECTIVE

To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components.

METHODS

Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed.

RESULTS

The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891-0.931), NE (0.697-0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations.

CONCLUSIONS

This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.

摘要

目的

创建具有语法和语义标签层的注释临床叙述,以促进临床自然语言处理(NLP)的发展。开发 NLP 算法和开源组件。

方法

按照句法信息的 Treebank 模式、谓词-论元结构的 PropBank 模式和语义信息的统一医学语言系统(UMLS)模式,对 127606 个标记的临床叙述语料库进行手动注释。开发了 NLP 组件。

结果

最终语料库包含 13091 个句子,包含 1772 个不同的谓词词干。在新创建的 766 个 PropBank 框架中,有 74 个是动词。有 28539 个命名实体(NE)注释分布在 15 个 UMLS 语义组、一个 UMLS 语义类型和 Person 语义类别中。最常见的注释属于 UMLS 语义组:程序(15.71%)、疾病(14.74%)、概念和思想(15.10%)、解剖(12.80%)、化学物质和药物(7.49%)以及 UMLS 语义类型:症状或体征(12.46%)。注释者间一致性结果:Treebank(0.926)、PropBank(0.891-0.931)、NE(0.697-0.750)。词性标记器、短语结构解析器、依存解析器和语义角色标签器是从语料库中构建的,并作为开源发布。该项目揭示了一个重大限制,即 NLP 社区需要为临床概念及其关系的注释制定一个广泛认可的方案。

结论

该项目朝着使临床 NLP 领域与一般领域的 NLP 相媲美迈出了基础一步。语料库创建和 NLP 组件为研究和应用开发提供了资源,这在以前是不可能的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b8b/3756257/b1dfd5c0c20a/amiajnl-2012-001317f01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验