Suppr超能文献

通过领域自适应提高临床叙述自然语言处理词性标注的性能。

Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.

机构信息

Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

出版信息

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):931-9. doi: 10.1136/amiajnl-2012-001453. Epub 2013 Mar 13.

Abstract

OBJECTIVE

Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives.

METHODS

Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt.

RESULTS

The evaluated POS taggers drop in accuracy by 8.5-15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3-91.0% on clinical texts. ClinAdapt reports 93.2-93.9%.

CONCLUSIONS

ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.

摘要

目的

自然语言处理(NLP)任务通常分解为子任务,通过链连接形成处理管道。这些子任务中产生的残差传播,对最终目标产生不利影响。在临床领域,由于临床数据的标注可用性有限,基于统计的 NLP 工具仍然难以达到最新的操作特性。在这里,我们探索了临床文本的独特语言结构,并展示了当在临床领域应用现成的词性(POS)标记工具时,操作特性的损失。我们测试了一种域自适应方法,该方法将基于转换的学习者中使用的新词汇生成概率规则集成到 POS 性能提升中。

方法

从两个独立医疗机构构建了两个高频临床叙事的目标语料库。从通用英语和生物医学文摘中训练的四个领先的 POS 标记器及其默认模型,在这些临床语料库上进行了评估。与我们新提出的 ClinAdapt 方法相比,比较了高性能的域自适应方法 EasyAdapt。

结果

评估的 POS 标记器在测试临床叙事时的准确性下降了 8.5-15%。性能最高的标记器报告的准确率为 88.6%。通过 EasyAdapt 进行域自适应的准确率为 88.3-91.0%。ClinAdapt 报告的准确率为 93.2-93.9%。

结论

通过需要少量标注临床数据的域自适应,ClinAdapt 成功提高了 POS 标记性能。提高关键 NLP 子任务的性能有望减少管道错误传播,从而在复杂处理任务中获得更好的整体结果。

相似文献

3
A token centric part-of-speech tagger for biomedical text.一种用于生物医学文本的以词元为中心的词性标注器。
Artif Intell Med. 2014 May;61(1):11-20. doi: 10.1016/j.artmed.2014.03.005. Epub 2014 Mar 26.
5
Domain-specific language models and lexicons for tagging.用于标记的特定领域语言模型和词汇表。
J Biomed Inform. 2005 Dec;38(6):422-30. doi: 10.1016/j.jbi.2005.02.009. Epub 2005 Apr 2.
7
Developing a corpus of clinical notes manually annotated for part-of-speech.开发一个词性人工标注的临床笔记语料库。
Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.

引用本文的文献

1
Natural Language Processing in Nephrology.肾病学中的自然语言处理。
Adv Chronic Kidney Dis. 2022 Sep;29(5):465-471. doi: 10.1053/j.ackd.2022.07.001.
6
Creation of a new longitudinal corpus of clinical narratives.创建一个新的临床叙事纵向语料库。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S6-S10. doi: 10.1016/j.jbi.2015.09.018. Epub 2015 Oct 1.
8
Domain adaptation for semantic role labeling of clinical text.临床文本语义角色标注的领域适应
J Am Med Inform Assoc. 2015 Sep;22(5):967-79. doi: 10.1093/jamia/ocu048. Epub 2015 Jun 10.

本文引用的文献

6
Domain-specific language models and lexicons for tagging.用于标记的特定领域语言模型和词汇表。
J Biomed Inform. 2005 Dec;38(6):422-30. doi: 10.1016/j.jbi.2005.02.009. Epub 2005 Apr 2.
7
Developing a corpus of clinical notes manually annotated for part-of-speech.开发一个词性人工标注的临床笔记语料库。
Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.
8
MedPost: a part-of-speech tagger for bioMedical text.MedPost:一种用于生物医学文本的词性标注器。
Bioinformatics. 2004 Sep 22;20(14):2320-1. doi: 10.1093/bioinformatics/bth227. Epub 2004 Apr 8.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验