基于自监督方法的疾病概念嵌入在电子健康记录中的医学信息提取和疾病检索：算法开发和验证研究。

Disease Concept-Embedding Based on the Self-Supervised Method for Medical Information Extraction from Electronic Health Records and Disease Retrieval: Algorithm Development and Validation Study.

机构信息

Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei City, Taiwan.

Department of Emergency Medicine, National Taiwan University BioMedical Park Hospital, Hsinchu County, Taiwan.

出版信息

J Med Internet Res. 2021 Jan 27;23(1):e25113. doi: 10.2196/25113.

DOI:10.2196/25113

PMID:33502324

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7875703/

Abstract

BACKGROUND

The electronic health record (EHR) contains a wealth of medical information. An organized EHR can greatly help doctors treat patients. In some cases, only limited patient information is collected to help doctors make treatment decisions. Because EHRs can serve as a reference for this limited information, doctors' treatment capabilities can be enhanced. Natural language processing and deep learning methods can help organize and translate EHR information into medical knowledge and experience.

OBJECTIVE

In this study, we aimed to create a model to extract concept embeddings from EHRs for disease pattern retrieval and further classification tasks.

METHODS

We collected 1,040,989 emergency department visits from the National Taiwan University Hospital Integrated Medical Database and 305,897 samples from the National Hospital and Ambulatory Medical Care Survey Emergency Department data. After data cleansing and preprocessing, the data sets were divided into training, validation, and test sets. We proposed a Transformer-based model to embed EHRs and used Bidirectional Encoder Representations from Transformers (BERT) to extract features from free text and concatenate features with structural data as input to our proposed model. Then, Deep InfoMax (DIM) and Simple Contrastive Learning of Visual Representations (SimCLR) were used for the unsupervised embedding of the disease concept. The pretrained disease concept-embedding model, named EDisease, was further finetuned to adapt to the critical care outcome prediction task. We evaluated the performance of embedding using t-distributed stochastic neighbor embedding (t-SNE) to perform dimension reduction for visualization. The performance of the finetuned predictive model was evaluated against published models using the area under the receiver operating characteristic (AUROC).

RESULTS

The performance of our model on the outcome prediction had the highest AUROC of 0.876. In the ablation study, the use of a smaller data set or fewer unsupervised methods for pretraining deteriorated the prediction performance. The AUROCs were 0.857, 0.870, and 0.868 for the model without pretraining, the model pretrained by only SimCLR, and the model pretrained by only DIM, respectively. On the smaller finetuning set, the AUROC was 0.815 for the proposed model.

CONCLUSIONS

Through contrastive learning methods, disease concepts can be embedded meaningfully. Moreover, these methods can be used for disease retrieval tasks to enhance clinical practice capabilities. The disease concept model is also suitable as a pretrained model for subsequent prediction tasks.

摘要

背景

电子健康记录（EHR）包含丰富的医疗信息。组织良好的 EHR 可以极大地帮助医生治疗患者。在某些情况下，仅收集有限的患者信息来帮助医生做出治疗决策。由于 EHR 可以作为有限信息的参考，因此可以提高医生的治疗能力。自然语言处理和深度学习方法可帮助组织和将 EHR 信息转换为医学知识和经验。

目的

本研究旨在创建一种从 EHR 中提取概念嵌入的模型，用于疾病模式检索和进一步的分类任务。

方法

我们从国立台湾大学医院综合医疗数据库中收集了 1,040,989 例急诊科就诊记录，从国家医院和门诊医疗保健调查急诊科数据中收集了 305,897 例样本。在数据清理和预处理之后，数据集被分为训练集、验证集和测试集。我们提出了一种基于 Transformer 的模型来嵌入 EHR，并使用来自 Transformer 的双向编码器表示（BERT）从自由文本中提取特征，并将特征与结构数据串联作为我们提出的模型的输入。然后，使用深度 InfoMax（DIM）和简单对比学习的视觉表示（SimCLR）对疾病概念进行无监督嵌入。预训练的疾病概念嵌入模型，命名为 EDisease，进一步进行微调以适应关键护理结果预测任务。我们使用 t 分布随机邻域嵌入（t-SNE）评估嵌入的性能，以进行可视化的降维。使用接受者操作特征（AUROC）评估针对已发布模型的微调预测模型的性能。

结果

我们的模型在结局预测上的性能具有最高的 AUROC 为 0.876。在消融研究中，使用较小的数据集或较少的无监督方法进行预训练会降低预测性能。无预训练、仅使用 SimCLR 预训练和仅使用 DIM 预训练的模型的 AUROCs 分别为 0.857、0.870 和 0.868。在较小的微调数据集上，所提出的模型的 AUROC 为 0.815。