用于从德语胸部X光报告中提取信息的半自动术语生成

Semi-Automatic Terminology Generation for Information Extraction from German Chest X-Ray Reports.

作者信息

Krebs Jonathan, Corovic Hamo, Dietrich Georg, Ertl Max, Fette Georg, Kaspar Mathias, Krug Markus, Stoerk Stefan, Puppe Frank

机构信息

Würzburg University, Chair of Computer Science 6; Germany.

Bamberg Hospital, Department of Radiology, Germany.

出版信息

Stud Health Technol Inform. 2017;243:80-84.

PMID:28883175

Abstract

Extraction of structured data from textual reports is an important subtask for building medical data warehouses for research and care. Many medical and most radiology reports are written in a telegraphic style with a concatenation of noun phrases describing the presence or absence of findings. Therefore a lexico-syntactical approach is promising, where key terms and their relations are recognized and mapped on a predefined standard terminology (ontology). We propose a two-phase algorithm for terminology matching: In the first pass, a local terminology for recognition is derived as close as possible to the terms used in the radiology reports. In the second pass, the local terminology is mapped to a standard terminology. In this paper, we report on an algorithm for the first step of semi-automatic generation of the local terminology and evaluate the algorithm with radiology reports of chest X-ray examinations from Würzburg university hospital. With an effort of about 20 hours work of a radiologist as domain expert and 10 hours for meetings, a local terminology with about 250 attributes and various value patterns was built. In an evaluation with 100 randomly chosen reports it achieved an F1-Score of about 95% for information extraction.

摘要

从文本报告中提取结构化数据是构建用于研究和医疗的医学数据仓库的一项重要子任务。许多医学报告以及大多数放射学报告都是以电报式风格撰写的，由一系列描述检查结果存在与否的名词短语组成。因此，一种词汇 - 句法方法很有前景，即识别关键术语及其关系，并将其映射到预定义的标准术语（本体）上。我们提出了一种用于术语匹配的两阶段算法：在第一阶段，尽可能推导与放射学报告中使用的术语相近的局部术语用于识别。在第二阶段，将局部术语映射到标准术语。在本文中，我们报告了半自动生成局部术语第一步的算法，并使用维尔茨堡大学医院的胸部X光检查放射学报告对该算法进行了评估。作为领域专家的放射科医生花费约20小时的工作量以及10小时用于会议，构建了一个具有约250个属性和各种值模式的局部术语。在对100份随机选择的报告进行的评估中，其信息提取的F1分数约为95%。