Suppr超能文献

临床数据仓库的临时信息提取

Ad Hoc Information Extraction for Clinical Data Warehouses.

作者信息

Dietrich Georg, Krebs Jonathan, Fette Georg, Ertl Maximilian, Kaspar Mathias, Störk Stefan, Puppe Frank

出版信息

Methods Inf Med. 2018 May;57(1):e22-e29. doi: 10.3414/ME17-02-0010. Epub 2018 May 25.

Abstract

BACKGROUND

Clinical Data Warehouses (CDW) reuse Electronic health records (EHR) to make their data retrievable for research purposes or patient recruitment for clinical trials. However, much information are hidden in unstructured data like discharge letters. They can be preprocessed and converted to structured data via information extraction (IE), which is unfortunately a laborious task and therefore usually not available for most of the text data in CDW.

OBJECTIVES

The goal of our work is to provide an ad hoc IE service that allows users to query text data ad hoc in a manner similar to querying structured data in a CDW. While search engines just return text snippets, our systems also returns frequencies (e.g. how many patients exist with "heart failure" including textual synonyms or how many patients have an LVEF < 45) based on the content of discharge letters or textual reports for special investigations like heart echo. Three subtasks are addressed: (1) To recognize and to exclude negations and their scopes, (2) to extract concepts, i.e. Boolean values and (3) to extract numerical values.

METHODS

We implemented an extended version of the NegEx-algorithm for German texts that detects negations and determines their scope. Furthermore, our document oriented CDW PaDaWaN was extended with query functions, e.g. context sensitive queries and regex queries, and an extraction mode for computing the frequencies for Boolean and numerical values.

RESULTS

Evaluations in chest X-ray reports and in discharge letters showed high F1-scores for the three subtasks: Detection of negated concepts in chest X-ray reports with an F1-score of 0.99 and in discharge letters with 0.97; of Boolean values in chest X-ray reports about 0.99, and of numerical values in chest X-ray reports and discharge letters also around 0.99 with the exception of the concept age.

DISCUSSION

The advantages of an ad hoc IE over a standard IE are the low development effort (just entering the concept with its variants), the promptness of the results and the adaptability by the user to his or her particular question. Disadvantage are usually lower accuracy and confidence.This ad hoc information extraction approach is novel and exceeds existing systems: Roogle [1] extracts predefined concepts from texts at preprocessing and makes them retrievable at runtime. Dr. Warehouse [2] applies negation detection and indexes the produced subtexts which include affirmed findings. Our approach combines negation detection and the extraction of concepts. But the extraction does not take place during preprocessing, but at runtime. That provides an ad hoc, dynamic, interactive and adjustable information extraction of random concepts and even their values on the fly at runtime.

CONCLUSIONS

We developed an ad hoc information extraction query feature for Boolean and numerical values within a CDW with high recall and precision based on a pipeline that detects and removes negations and their scope in clinical texts.

摘要

背景

临床数据仓库(CDW)通过复用电子健康记录(EHR),使其数据可用于研究目的或临床试验的患者招募。然而,许多信息隐藏在诸如出院小结等非结构化数据中。这些数据可以通过信息提取(IE)进行预处理并转换为结构化数据,但遗憾的是,这是一项艰巨的任务,因此CDW中的大多数文本数据通常无法进行此操作。

目的

我们工作的目标是提供一种即席信息提取服务,允许用户以类似于在CDW中查询结构化数据的方式即席查询文本数据。虽然搜索引擎只返回文本片段,但我们的系统还会根据出院小结的内容或心脏超声等特殊检查的文本报告返回频率(例如,有多少患者患有“心力衰竭”,包括文本同义词,或有多少患者的左心室射血分数<45)。我们解决了三个子任务:(1)识别并排除否定词及其范围;(2)提取概念,即布尔值;(3)提取数值。

方法

我们为德语文本实现了NegEx算法的扩展版本,用于检测否定词并确定其范围。此外,我们面向文档的CDW PaDaWaN扩展了查询功能,如上下文敏感查询和正则表达式查询,以及用于计算布尔值和数值频率的提取模式。

结果

在胸部X光报告和出院小结中的评估显示,这三个子任务的F1分数都很高:在胸部X光报告中检测否定概念的F1分数为0.99,在出院小结中为0.97;在胸部X光报告中提取布尔值的F1分数约为0.99,在胸部X光报告和出院小结中提取数值的F1分数除年龄概念外也约为0.99。

讨论

即席信息提取相对于标准信息提取的优点是开发工作量低(只需输入概念及其变体)、结果即时性以及用户对特定问题的适应性。缺点通常是准确性和可信度较低。这种即席信息提取方法是新颖的,超越了现有系统:Roogle [1]在预处理时从文本中提取预定义概念,并使其在运行时可检索。Dr. Warehouse [2]应用否定检测并对生成的包含肯定结果的子文本进行索引。我们的方法结合了否定检测和概念提取。但提取不是在预处理期间进行,而是在运行时进行。这提供了一种即席、动态、交互式和可调整的信息提取,可在运行时即时提取随机概念甚至其值。

结论

我们基于一个在临床文本中检测和消除否定词及其范围的管道,为CDW中的布尔值和数值开发了一种即席信息提取查询功能,具有高召回率和精确率。

相似文献

1
Ad Hoc Information Extraction for Clinical Data Warehouses.
Methods Inf Med. 2018 May;57(1):e22-e29. doi: 10.3414/ME17-02-0010. Epub 2018 May 25.
2
Replicating medication trend studies using ad hoc information extraction in a clinical data warehouse.
BMC Med Inform Decis Mak. 2019 Jan 18;19(1):15. doi: 10.1186/s12911-018-0729-0.
4
A clinician friendly data warehouse oriented toward narrative reports: Dr. Warehouse.
J Biomed Inform. 2018 Apr;80:52-63. doi: 10.1016/j.jbi.2018.02.019. Epub 2018 Mar 1.
7
PDF text classification to leverage information extraction from publication reports.
J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.
9
Structuring Legacy Pathology Reports by openEHR Archetypes to Enable Semantic Querying.
Methods Inf Med. 2017 May 18;56(3):230-237. doi: 10.3414/ME16-01-0073. Epub 2017 Feb 28.
10
The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience.
Int J Med Inform. 2017 Jun;102:21-28. doi: 10.1016/j.ijmedinf.2017.02.006. Epub 2017 Feb 16.

引用本文的文献

3
Querying a Clinical Data Warehouse for Combinations of Clinical and Imaging Data.
J Digit Imaging. 2023 Apr;36(2):715-724. doi: 10.1007/s10278-022-00727-3. Epub 2022 Nov 23.
4
Covering patient's perspective in case-based critical review articles to improve shared decision making in complex cases.
Health Expect. 2020 Oct;23(5):1037-1044. doi: 10.1111/hex.13108. Epub 2020 Jul 23.
5
Unlocking the PACS DICOM Domain for its Use in Clinical Research Data Warehouses.
J Digit Imaging. 2020 Aug;33(4):1016-1025. doi: 10.1007/s10278-020-00334-0.
7
Clinical Research Informatics: Contributions from 2018.
Yearb Med Inform. 2019 Aug;28(1):203-205. doi: 10.1055/s-0039-1677921. Epub 2019 Aug 16.
8
Replicating medication trend studies using ad hoc information extraction in a clinical data warehouse.
BMC Med Inform Decis Mak. 2019 Jan 18;19(1):15. doi: 10.1186/s12911-018-0729-0.

本文引用的文献

4
BigQ: a NoSQL based framework to handle genomic variants in i2b2.
BMC Bioinformatics. 2015 Dec 29;16:415. doi: 10.1186/s12859-015-0861-0.
5
Fine-grained information extraction from German transthoracic echocardiography reports.
BMC Med Inform Decis Mak. 2015 Nov 12;15:91. doi: 10.1186/s12911-015-0215-x.
6
DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx.
J Biomed Inform. 2015 Apr;54:213-9. doi: 10.1016/j.jbi.2015.02.010. Epub 2015 Mar 16.
7
ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus.
BMC Bioinformatics. 2014 Nov 29;15(1):373. doi: 10.1186/s12859-014-0373-3.
8
Negation's not solved: generalizability versus optimizability in clinical natural language processing.
PLoS One. 2014 Nov 13;9(11):e112774. doi: 10.1371/journal.pone.0112774. eCollection 2014.
9
Translational research platforms integrating clinical and omics data: a review of publicly available solutions.
Brief Bioinform. 2015 Mar;16(2):280-90. doi: 10.1093/bib/bbu006. Epub 2014 Mar 7.
10
Secondary use of clinical data: the Vanderbilt approach.
J Biomed Inform. 2014 Dec;52:28-35. doi: 10.1016/j.jbi.2014.02.003. Epub 2014 Feb 14.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验