Suppr超能文献

CUILESS2016:一个应用文本提及成分归一化的临床语料库。

CUILESS2016: a clinical corpus applying compositional normalization of text mentions.

作者信息

Osborne John D, Neu Matthew B, Danila Maria I, Solorio Thamar, Bethard Steven J

机构信息

University of Alabama at Birmingham, 7th Ave S, Birmingham, 1720, USA.

Computer Science Department, University of Houston, Düsternbrooker Weg 20, Houston, 24105, USA.

出版信息

J Biomed Semantics. 2018 Jan 10;9(1):2. doi: 10.1186/s13326-017-0173-6.

Abstract

BACKGROUND

Traditionally text mention normalization corpora have normalized concepts to single ontology identifiers ("pre-coordinated concepts"). Less frequently, normalization corpora have used concepts with multiple identifiers ("post-coordinated concepts") but the additional identifiers have been restricted to a defined set of relationships to the core concept. This approach limits the ability of the normalization process to express semantic meaning. We generated a freely available corpus using post-coordinated concepts without a defined set of relationships that we term "compositional concepts" to evaluate their use in clinical text.

METHODS

We annotated 5397 disorder mentions from the ShARe corpus to SNOMED CT that were previously normalized as "CUI-less" in the "SemEval-2015 Task 14" shared task because they lacked a pre-coordinated mapping. Unlike the previous normalization method, we do not restrict concept mappings to a particular set of the Unified Medical Language System (UMLS) semantic types and allow normalization to occur to multiple UMLS Concept Unique Identifiers (CUIs). We computed annotator agreement and assessed semantic coverage with this method.

RESULTS

We generated the largest clinical text normalization corpus to date with mappings to multiple identifiers and made it freely available. All but 8 of the 5397 disorder mentions were normalized using this methodology. Annotator agreement ranged from 52.4% using the strictest metric (exact matching) to 78.2% using a hierarchical agreement that measures the overlap of shared ancestral nodes.

CONCLUSION

Our results provide evidence that compositional concepts can increase semantic coverage in clinical text. To our knowledge we provide the first freely available corpus of compositional concept annotation in clinical text.

摘要

背景

传统上,文本提及规范化语料库将概念规范化为单个本体标识符(“预协调概念”)。较少情况下,规范化语料库使用具有多个标识符的概念(“后协调概念”),但额外的标识符被限制在与核心概念的一组定义关系内。这种方法限制了规范化过程表达语义的能力。我们使用后协调概念生成了一个免费可用的语料库,这些概念没有一组定义的关系,我们将其称为“组合概念”,以评估它们在临床文本中的使用情况。

方法

我们将来自ShARe语料库的5397个疾病提及标注到SNOMED CT,这些提及在“SemEval - 2015任务14”共享任务中因缺乏预协调映射而被规范化为“无CUI”。与之前的规范化方法不同,我们不将概念映射限制于统一医学语言系统(UMLS)语义类型的特定集合,并允许规范化为多个UMLS概念唯一标识符(CUI)。我们用这种方法计算了注释者之间的一致性并评估了语义覆盖范围。

结果

我们生成了迄今为止最大的临床文本规范化语料库,具有到多个标识符的映射,并使其免费可用。5397个疾病提及中除8个外均使用此方法进行了规范化。注释者之间的一致性范围从使用最严格度量标准(精确匹配)时的52.4%到使用测量共享祖先节点重叠的层次一致性时的78.2%。

结论

我们的结果提供了证据,表明组合概念可以增加临床文本中的语义覆盖范围。据我们所知,我们提供了临床文本中第一个免费可用的组合概念注释语料库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b49/5761157/b08e42d9c93d/13326_2017_173_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验