通过领域自适应提高临床叙述自然语言处理词性标注的性能。

Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.

机构信息

Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

出版信息

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):931-9. doi: 10.1136/amiajnl-2012-001453. Epub 2013 Mar 13.

DOI:10.1136/amiajnl-2012-001453

PMID:23486109

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3756264/

Abstract

OBJECTIVE

Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives.

METHODS

Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt.

RESULTS

The evaluated POS taggers drop in accuracy by 8.5-15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3-91.0% on clinical texts. ClinAdapt reports 93.2-93.9%.

CONCLUSIONS

ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.

摘要

目的

自然语言处理（NLP）任务通常分解为子任务，通过链连接形成处理管道。这些子任务中产生的残差传播，对最终目标产生不利影响。在临床领域，由于临床数据的标注可用性有限，基于统计的 NLP 工具仍然难以达到最新的操作特性。在这里，我们探索了临床文本的独特语言结构，并展示了当在临床领域应用现成的词性（POS）标记工具时，操作特性的损失。我们测试了一种域自适应方法，该方法将基于转换的学习者中使用的新词汇生成概率规则集成到 POS 性能提升中。

方法

从两个独立医疗机构构建了两个高频临床叙事的目标语料库。从通用英语和生物医学文摘中训练的四个领先的 POS 标记器及其默认模型，在这些临床语料库上进行了评估。与我们新提出的 ClinAdapt 方法相比，比较了高性能的域自适应方法 EasyAdapt。

结果

评估的 POS 标记器在测试临床叙事时的准确性下降了 8.5-15%。性能最高的标记器报告的准确率为 88.6%。通过 EasyAdapt 进行域自适应的准确率为 88.3-91.0%。ClinAdapt 报告的准确率为 93.2-93.9%。

结论

通过需要少量标注临床数据的域自适应，ClinAdapt 成功提高了 POS 标记性能。提高关键 NLP 子任务的性能有望减少管道错误传播，从而在复杂处理任务中获得更好的整体结果。

相似文献

Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.通过领域自适应提高临床叙述自然语言处理词性标注的性能。

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):931-9. doi: 10.1136/amiajnl-2012-001453. Epub 2013 Mar 13.

Part-of-speech tagging for clinical text: wall or bridge between institutions?临床文本的词性标注：机构之间的壁垒还是桥梁？

AMIA Annu Symp Proc. 2011;2011:382-91. Epub 2011 Oct 22.

A token centric part-of-speech tagger for biomedical text.一种用于生物医学文本的以词元为中心的词性标注器。

Artif Intell Med. 2014 May;61(1):11-20. doi: 10.1016/j.artmed.2014.03.005. Epub 2014 Mar 26.

Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.用于最小化词性标注器参考标准训练集的启发式样本选择。

J Am Med Inform Assoc. 2007 Sep-Oct;14(5):641-50. doi: 10.1197/jamia.M2392. Epub 2007 Jun 28.

Domain-specific language models and lexicons for tagging.用于标记的特定领域语言模型和词汇表。

J Biomed Inform. 2005 Dec;38(6):422-30. doi: 10.1016/j.jbi.2005.02.009. Epub 2005 Apr 2.

A Part-Of-Speech term weighting scheme for biomedical information retrieval.一种用于生物医学信息检索的词性术语加权方案。

J Biomed Inform. 2016 Oct;63:379-389. doi: 10.1016/j.jbi.2016.08.026. Epub 2016 Sep 1.

Developing a corpus of clinical notes manually annotated for part-of-speech.开发一个词性人工标注的临床笔记语料库。

Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.一个用于临床文本的细粒度中文分词和词性标注语料库。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.

Performance analysis of a POS tagger applied to discharge summaries in Portuguese.应用于葡萄牙语出院小结的词性标注器性能分析。

Stud Health Technol Inform. 2010;160(Pt 2):959-63.

A universal multilingual weightless neural network tagger via quantitative linguistics.一种基于定量语言学的通用多语言无权重神经网络标记器。

Neural Netw. 2017 Jul;91:85-101. doi: 10.1016/j.neunet.2017.04.011. Epub 2017 Apr 26.

引用本文的文献

Natural Language Processing in Nephrology.肾病学中的自然语言处理。

Adv Chronic Kidney Dis. 2022 Sep;29(5):465-471. doi: 10.1053/j.ackd.2022.07.001.

Evolving Role and Future Directions of Natural Language Processing in Gastroenterology.自然语言处理在胃肠病学中的作用演变及未来方向。

Dig Dis Sci. 2021 Jan;66(1):29-40. doi: 10.1007/s10620-020-06156-y. Epub 2020 Feb 27.

CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital.CogStack-在大型国民保健制度基金会信托医院中部署集成信息检索和提取服务的经验。

BMC Med Inform Decis Mak. 2018 Jun 25;18(1):47. doi: 10.1186/s12911-018-0623-9.

Ranking Medical Terms to Support Expansion of Lay Language Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach.对医学术语进行排序以支持扩展用于患者理解电子健康记录笔记的通俗语言资源：适应性远程监督方法。

JMIR Med Inform. 2017 Oct 31;5(4):e42. doi: 10.2196/medinform.8531.

The effects of natural language processing on cross-institutional portability of influenza case detection for disease surveillance.自然语言处理对用于疾病监测的流感病例检测跨机构可移植性的影响。

Appl Clin Inform. 2017 May 31;8(2):560-580. doi: 10.4338/ACI-2016-12-RA-0211.

Creation of a new longitudinal corpus of clinical narratives.创建一个新的临床叙事纵向语料库。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S6-S10. doi: 10.1016/j.jbi.2015.09.018. Epub 2015 Oct 1.

Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.支持语义分析的临床自然语言处理的最新进展。

Yearb Med Inform. 2015 Aug 13;10(1):183-93. doi: 10.15265/IY-2015-009.

Domain adaptation for semantic role labeling of clinical text.临床文本语义角色标注的领域适应

J Am Med Inform Assoc. 2015 Sep;22(5):967-79. doi: 10.1093/jamia/ocu048. Epub 2015 Jun 10.

Use of adjectives in abstracts when reporting results of randomized, controlled trials from industry and academia.在报告来自行业和学术界的随机对照试验结果时，摘要中形容词的使用情况。

Drugs R D. 2015 Mar;15(1):85-139. doi: 10.1007/s40268-015-0085-9.

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.用于注释BioC文集的自然语言处理管道及其在NCBI疾病语料库中的应用。

Database (Oxford). 2014 Jun 16;2014. doi: 10.1093/database/bau056. Print 2014.

本文引用的文献

Part-of-speech tagging for clinical text: wall or bridge between institutions?临床文本的词性标注：机构之间的壁垒还是桥梁？

AMIA Annu Symp Proc. 2011;2011:382-91. Epub 2011 Oct 22.

Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions.克服临床文本自然语言处理的障碍：共享任务的作用及对其他创造性解决方案的需求。

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):540-3. doi: 10.1136/amiajnl-2011-000465.

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.梅奥临床文本分析和知识提取系统（cTAKES）：架构、组件评估和应用。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507-13. doi: 10.1136/jamia.2009.001560.

Extracting information from textual documents in the electronic health record: a review of recent research.从电子健康记录中的文本文件提取信息：近期研究综述

Yearb Med Inform. 2008:128-44.

Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.用于最小化词性标注器参考标准训练集的启发式样本选择。

J Am Med Inform Assoc. 2007 Sep-Oct;14(5):641-50. doi: 10.1197/jamia.M2392. Epub 2007 Jun 28.

Domain-specific language models and lexicons for tagging.用于标记的特定领域语言模型和词汇表。

J Biomed Inform. 2005 Dec;38(6):422-30. doi: 10.1016/j.jbi.2005.02.009. Epub 2005 Apr 2.

Developing a corpus of clinical notes manually annotated for part-of-speech.开发一个词性人工标注的临床笔记语料库。

Int J Med Inform. 2006 Jun;75(6):418-29. doi: 10.1016/j.ijmedinf.2005.08.006. Epub 2005 Sep 19.

MedPost: a part-of-speech tagger for bioMedical text.MedPost：一种用于生物医学文本的词性标注器。

Bioinformatics. 2004 Sep 22;20(14):2320-1. doi: 10.1093/bioinformatics/bth227. Epub 2004 Apr 8.

GENIA corpus--semantically annotated corpus for bio-textmining.GENIA语料库——用于生物文本挖掘的语义标注语料库。

Bioinformatics. 2003;19 Suppl 1:i180-2. doi: 10.1093/bioinformatics/btg1023.

Two biomedical sublanguages: a description based on the theories of Zellig Harris.两种生物医学子语言：基于泽利格·哈里斯理论的一种描述

J Biomed Inform. 2002 Aug;35(4):222-35. doi: 10.1016/s1532-0464(03)00012-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验