Suppr超能文献

[一种从电子病历非结构化文本数据中提取信息的定制方法]

[A customized method for information extraction from unstructured text data in the electronic medical records].

作者信息

Bao X Y, Huang W J, Zhang K, Jin M, Li Y, Niu C Z

机构信息

Medical Informatics Center, Peking University, Beijing 100191, China; National Clinical Service Data Center, Beijing 100191, China.

School of Mathematical Sciences, Peking University, Beijing 100871, China.

出版信息

Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):256-263.

Abstract

OBJECTIVE

There is a huge amount of diagnostic or treatment information in electronic medical record (EMR), which is a concrete manifestation of clinicians actual diagnosis and treatment details. Plenty of episodes in EMRs, such as complaints, present illness, past history, differential diagnosis, diagnostic imaging, surgical records, reflecting details of diagnosis and treatment in clinical process, adopt Chinese description of natural language. How to extract effective information from these Chinese narrative text data, and organize it into a form of tabular for analysis of medical research, for the practical utilization of clinical data in the real world, is a difficult problem in Chinese medical data processing.

METHODS

Based on the EMRs narrative text data in a tertiary hospital in China, a customized information extracting rules learning, and rule based information extraction methods is proposed. The overall method consists of three steps, which includes: (1) Step 1, a random sample of 600 copies (including the history of present illness, past history, personal history, family history, etc.) of the electronic medical record data, was extracted as raw corpora. With our developed Chinese clinical narrative text annotation platform, the trained clinician and nurses marked the tokens and phrases in the corpora which would be extracted (with a history of diabetes as an example). (2) Step 2, based on the annotated corpora clinical text data, some extraction templates were summarized and induced firstly. Then these templates were rewritten using regular expressions of Perl programming language, as extraction rules. Using these extraction rules as basic knowledge base, we developed extraction packages in Perl, for extracting data from the EMRs text data. In the end, the extracted data items were organized in tabular data format, for later usage in clinical research or hospital surveillance purposes. (3) As the final step of the method, the evaluation and validation of the proposed methods were implemented in the National Clinical Service Data Integration Platform, and we checked the extraction results using artificial verification and automated verification combined, proved the effectiveness of the method.

RESULTS

For all the patients with diabetes as diagnosed disease in the Department of Endocrine in the hospital, the medical history episode of these patients showed that, altogether 1 436 patients were dismissed in 2015, and a history of diabetes medical records extraction results showed that the recall rate was 87.6%, the accuracy rate was 99.5%, and F-Score was 0.93. For all the 10% patients (totally 1 223 patients) with diabetes by the dismissed dates of August 2017 in the same department, the extracted diabetes history extraction results showed that the recall rate was 89.2%, the accuracy rate was 99.2%, F-Score was 0.94.

CONCLUSION

This study mainly adopts the combination of natural language processing and rule-based information extraction, and designs and implements an algorithm for extracting customized information from unstructured Chinese electronic medical record text data. It has better results than existing work.

摘要

目的

电子病历(EMR)中存在海量的诊断或治疗信息,是临床医生实际诊疗细节的具体体现。电子病历中的大量片段,如主诉、现病史、既往史、鉴别诊断、诊断影像学、手术记录等,反映了临床过程中的诊疗细节,采用自然语言中文描述。如何从这些中文叙述文本数据中提取有效信息,并将其整理成表格形式用于医学研究分析,以实现临床数据在现实世界中的实际应用,是中文医学数据处理中的难题。

方法

基于我国一家三级医院的电子病历叙述文本数据,提出一种定制化信息提取规则学习及基于规则的信息提取方法。总体方法包括三个步骤:(1)步骤一,随机抽取600份电子病历数据样本(包括现病史、既往史、个人史、家族史等)作为原始语料库。借助我们开发的中文临床叙述文本标注平台,训练有素的临床医生和护士对语料库中要提取的词元和短语进行标注(以糖尿病病史为例)。(2)步骤二,基于标注后的语料库临床文本数据,首先总结归纳一些提取模板。然后使用Perl编程语言的正则表达式对这些模板进行改写,作为提取规则。以这些提取规则作为基础知识库,我们用Perl开发提取包,从电子病历文本数据中提取数据。最后,将提取的数据项整理成表格数据格式,供后续临床研究或医院监测使用。(3)作为该方法的最后一步,在国家临床服务数据集成平台上对所提方法进行评估和验证,我们采用人工验证和自动验证相结合的方式检查提取结果,证明了该方法的有效性。

结果

对于该医院内分泌科确诊为糖尿病的所有患者,这些患者的病史片段显示,2015年共有1436例患者出院,糖尿病病史病历提取结果显示召回率为87.6%,准确率为99.5%,F值为0.93。对于同一科室截至2017年8月出院日期的所有10%的糖尿病患者(共1223例),提取的糖尿病病史提取结果显示召回率为89.2%,准确率为99.2%,F值为0.94。

结论

本研究主要采用自然语言处理与基于规则的信息提取相结合的方法,设计并实现了一种从非结构化中文电子病历文本数据中提取定制化信息的算法。其效果优于现有工作。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验