Suppr超能文献

开发一个手动标注的临床文档语料库,以识别炎症性肠病的表型信息。

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease.

机构信息

VA Salt Lake City Health Care System, IDEAS Center, Salt Lake City, Utah 84148, USA.

出版信息

BMC Bioinformatics. 2009 Sep 17;10 Suppl 9(Suppl 9):S12. doi: 10.1186/1471-2105-10-S9-S12.

Abstract

BACKGROUND

Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.

METHODS

Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types.

RESULTS

Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield.

CONCLUSION

Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.

摘要

背景

自然语言处理 (NLP) 系统可用于特定的信息提取 (IE) 任务,例如从电子病历 (EMR) 中提取表型数据。这些数据对于转化研究很有用,并且通常只在免费的临床笔记中找到。IE 的关键要求步骤是对临床语料库进行手动注释,并创建(1)培训和验证任务的参考标准,以及(2)专注于阐明 NLP 系统要求。这些任务既耗时又昂贵,需要人类审查员付出大量努力。

方法

我们使用来自 VA EMR 的一组临床文档来解决特定用例的问题,确定了特定的挑战,并提出了几种注释任务的机会。我们使用开源注释工具、自定义注释方案以及一组已知患有炎症性肠病 (IBD) 诊断的患者的临床文档来演示特定方法。我们报告了文档、概念和概念属性级别的临床注释器的协议。我们根据特定注释部分和文档类型内的注释概念来估计概念产量。

结果

使用针对 IBD 感兴趣的概念的估计 Kappa 统计量 (95% CI) 计算,对包含感兴趣概念的文档的文档级注释器协议非常高,为 0.87 (0.82, 0.93)。在概念级别,F-测量范围从 0.61 到 0.83。然而,在特定概念属性级别,协议差异很大。对于这个特定的用例(IBD),每个文档产生的最高概念产量的临床文档包括 GI 诊所笔记和初级保健笔记。在各种类型的笔记中,最高的概念产量在代表患者评估和就诊疾病史的部分中。辅助服务文档和家族史和计划笔记部分的概念产量最低。

结论

挑战包括定义和构建适当的注释方案、充分培训临床注释器以及确定要注释的适当信息级别。机会包括将信息提取的重点缩小到特定用例的笔记类型和部分,尤其是在 NLP 系统将用于从大型电子临床笔记文档库中提取信息的情况下。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9fb0/2745683/8295bcf4e4ec/1471-2105-10-S9-S12-1.jpg

相似文献

1
Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease.
BMC Bioinformatics. 2009 Sep 17;10 Suppl 9(Suppl 9):S12. doi: 10.1186/1471-2105-10-S9-S12.
3
NCBI disease corpus: a resource for disease name recognition and concept normalization.
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
4
Ensembles of natural language processing systems for portable phenotyping solutions.
J Biomed Inform. 2019 Dec;100:103318. doi: 10.1016/j.jbi.2019.103318. Epub 2019 Oct 23.
6
De-identification of clinical notes in French: towards a protocol for reference corpus development.
J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.
9
Assisted annotation of medical free text using RapTAT.
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):833-41. doi: 10.1136/amiajnl-2013-002255. Epub 2014 Jan 15.
10
Validating a strategy for psychosocial phenotyping using a large corpus of clinical text.
J Am Med Inform Assoc. 2013 Dec;20(e2):e355-64. doi: 10.1136/amiajnl-2013-001946. Epub 2013 Oct 29.

引用本文的文献

3
Mapping the Bibliometrics Landscape of AI in Medicine: Methodological Study.
J Med Internet Res. 2023 Dec 8;25:e45815. doi: 10.2196/45815.
4
5
Post-Acute COVID-19 Respiratory Symptoms in Patients With Asthma: An Electronic Health Records-Based Study.
J Allergy Clin Immunol Pract. 2023 Mar;11(3):825-835.e3. doi: 10.1016/j.jaip.2022.12.003. Epub 2022 Dec 22.
7
Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation.
JAMIA Open. 2020 Oct 14;3(3):431-438. doi: 10.1093/jamiaopen/ooaa029. eCollection 2020 Oct.
8
Clinical concept extraction: A methodology review.
J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.
9
Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction.
BMC Med Inform Decis Mak. 2020 Mar 30;20(1):60. doi: 10.1186/s12911-020-1072-9.
10
Automating the Capture of Structured Pathology Data for Prostate Cancer Clinical Care and Research.
JCO Clin Cancer Inform. 2019 Jul;3:1-8. doi: 10.1200/CCI.18.00084.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验