用于临床概念提取的标注语料库合并的可行性。

Feasibility of pooling annotated corpora for clinical concept extraction.

作者信息

Wagholikar Kavishwar, Torii Manabu, Jonnalagadda Siddhartha, Liu Hongfang

机构信息

Mayo Clinic, Rochester, MN;

出版信息

AMIA Jt Summits Transl Sci Proc. 2012;2012:38. Epub 2012 Mar 19.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3392069/

Abstract

Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.

摘要

带注释语料库的可用性促进了机器学习算法在从临床记录中提取概念方面的应用。然而，在各个机构中准备带注释语料库成本高昂，而汇集其他机构的带注释语料库是一种潜在的解决方案。在本文中，我们研究了汇集来自两个不同来源的语料库是否可以提高用于医疗问题检测的机器学习标记器的性能和可移植性。具体而言，我们汇集了来自2010年i2b2/VA自然语言处理挑战赛和梅奥诊所罗切斯特分院的语料库，以评估用于识别医疗问题的标记器。与我们的预期相反，发现汇集语料库会降低F1分数。我们检查注释指南以确定语料库不兼容的因素，并建议临床自然语言处理社区制定标准注释指南，以实现带注释语料库的兼容性。

相似文献

1

Feasibility of pooling annotated corpora for clinical concept extraction.用于临床概念提取的标注语料库合并的可行性。

AMIA Jt Summits Transl Sci Proc. 2012;2012:38. Epub 2012 Mar 19.

2

Pooling annotated corpora for clinical concept extraction.合并带注释语料库用于临床概念提取。

J Biomed Semantics. 2013 Jan 8;4(1):3. doi: 10.1186/2041-1480-4-3.

3

Using machine learning for concept extraction on clinical documents from multiple data sources.利用机器学习从多个数据源的临床文档中提取概念。

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):580-7. doi: 10.1136/amiajnl-2011-000155. Epub 2011 Jun 27.

4

Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.利用标注保留的机器翻译将英文语料库翻译为荷兰文，以验证荷兰临床概念提取工具。

J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.

5

Standardizing Heterogeneous Annotation Corpora Using HL7 FHIR for Facilitating their Reuse and Integration in Clinical NLP.使用HL7 FHIR对异构注释语料库进行标准化，以促进其在临床自然语言处理中的重用和整合。

AMIA Annu Symp Proc. 2018 Dec 5;2018:574-583. eCollection 2018.

6

Ensembles of natural language processing systems for portable phenotyping solutions.用于便携表型解决方案的自然语言处理系统集合。

J Biomed Inform. 2019 Dec;100:103318. doi: 10.1016/j.jbi.2019.103318. Epub 2019 Oct 23.

7

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions.用于从临床记录中提取健康信息社会决定因素的大语言模型——一种适用于各机构的通用方法。

medRxiv. 2024 May 22:2024.05.21.24307726. doi: 10.1101/2024.05.21.24307726.

8

The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions.DDI 语料库：一个带有药理学物质和药物相互作用注释的语料库。

J Biomed Inform. 2013 Oct;46(5):914-20. doi: 10.1016/j.jbi.2013.07.011. Epub 2013 Jul 29.

9

Investigating heterogeneous protein annotations toward cross-corpora utilization.研究跨语料库利用的异构蛋白质注释。

BMC Bioinformatics. 2009 Dec 9;10:403. doi: 10.1186/1471-2105-10-403.

10

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease.开发一个人工标注的临床文档语料库以识别炎症性肠病的表型信息。

Summit Transl Bioinform. 2009 Mar 1;2009:1-32.

引用本文的文献

1

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.开发和验证一种自然语言处理算法，以在临床数据仓库环境中对文档进行化名处理。

Methods Inf Med. 2024 May;63(1-02):21-34. doi: 10.1055/s-0044-1778693. Epub 2024 Mar 5.

2

Near Real-time Natural Language Processing for the Extraction of Abdominal Aortic Aneurysm Diagnoses From Radiology Reports: Algorithm Development and Validation Study.用于从放射学报告中提取腹主动脉瘤诊断的近实时自然语言处理：算法开发与验证研究

JMIR Med Inform. 2023 Feb 24;11:e40964. doi: 10.2196/40964.

3

Clinical concept extraction: A methodology review.临床概念提取：方法学综述。

J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.

4

Ensembles of natural language processing systems for portable phenotyping solutions.用于便携表型解决方案的自然语言处理系统集合。

J Biomed Inform. 2019 Dec;100:103318. doi: 10.1016/j.jbi.2019.103318. Epub 2019 Oct 23.

5

Cohort Profile: The Right Drug, Right Dose, Right Time: Using Genomic Data to Individualize Treatment Protocol (RIGHT Protocol).队列简介：正确的药物、正确的剂量、正确的时间：利用基因组数据个体化治疗方案（RIGHT方案）。

Int J Epidemiol. 2020 Feb 1;49(1):23-24k. doi: 10.1093/ije/dyz123.

6

Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions.临床文档差异与自然语言处理系统的可移植性：跨机构哮喘出生队列的案例研究

J Am Med Inform Assoc. 2018 Mar 1;25(3):353-359. doi: 10.1093/jamia/ocx138.

7

Identifying Peripheral Arterial Disease Cases Using Natural Language Processing of Clinical Notes.使用临床记录的自然语言处理识别外周动脉疾病病例

IEEE EMBS Int Conf Biomed Health Inform. 2016 Feb;2016:126-131. doi: 10.1109/BHI.2016.7455851. Epub 2016 Apr 21.

8

Identifying Abdominal Aortic Aneurysm Cases and Controls using Natural Language Processing of Radiology Reports.利用放射学报告的自然语言处理识别腹主动脉瘤病例与对照。

AMIA Jt Summits Transl Sci Proc. 2013 Mar 18;2013:249-53. eCollection 2013.

9

Analysis of cross-institutional medication description patterns in clinical narratives.临床叙述中跨机构用药描述模式分析

Biomed Inform Insights. 2013 Jun 24;6(Suppl 1):7-16. doi: 10.4137/BII.S11634. Print 2013.

10

Comprehensive temporal information detection from clinical text: medical events, time, and TLINK identification.从临床文本中进行全面的时间信息检测：医学事件、时间和 TLINK 识别。

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):836-42. doi: 10.1136/amiajnl-2013-001622. Epub 2013 Apr 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验