Chang David, Hong Woo Suk, Taylor Richard Andrew
Computational Biology and Bioinformatics Program, Yale University, New Haven, Connecticut, USA.
Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA.
JAMIA Open. 2020 Jul 15;3(2):160-166. doi: 10.1093/jamiaopen/ooaa022. eCollection 2020 Jul.
We learn contextual embeddings for emergency department (ED) chief complaints using Bidirectional Encoder Representations from Transformers (BERT), a state-of-the-art language model, to derive a compact and computationally useful representation for free-text chief complaints.
Retrospective data on 2.1 million adult and pediatric ED visits was obtained from a large healthcare system covering the period of March 2013 to July 2019. A total of 355 497 (16.4%) visits from 65 737 (8.9%) patients were removed for absence of either a structured or unstructured chief complaint. To ensure adequate training set size, chief complaint labels that comprised less than 0.01%, or 1 in 10 000, of all visits were excluded. The cutoff threshold was incremented on a log scale to create seven datasets of decreasing sparsity. The classification task was to predict the provider-assigned label from the free-text chief complaint using BERT, with Long Short-Term Memory (LSTM) and Embeddings from Language Models (ELMo) as baselines. Performance was measured as the Top-k accuracy from = 1:5 on a hold-out test set comprising 5% of the samples. The embedding for each free-text chief complaint was extracted as the final 768-dimensional layer of the BERT model and visualized using t-distributed stochastic neighbor embedding (t-SNE).
The models achieved increasing performance with datasets of decreasing sparsity, with BERT outperforming both LSTM and ELMo. The BERT model yielded Top-1 accuracies of 0.65 and 0.69, Top-3 accuracies of 0.87 and 0.90, and Top-5 accuracies of 0.92 and 0.94 on datasets comprised of 434 and 188 labels, respectively. Visualization using t-SNE mapped the learned embeddings in a clinically meaningful way, with related concepts embedded close to each other and broader types of chief complaints clustered together.
Despite the inherent noise in the chief complaint label space, the model was able to learn a rich representation of chief complaints and generate reasonable predictions of their labels. The learned embeddings accurately predict provider-assigned chief complaint labels and map semantically similar chief complaints to nearby points in vector space.
Such a model may be used to automatically map free-text chief complaints to structured fields and to assist the development of a standardized, data-driven ontology of chief complaints for healthcare institutions.
我们使用最先进的语言模型——来自变换器的双向编码器表征(BERT)来学习急诊科(ED)主诉的上下文嵌入,以便为自由文本形式的主诉得出一种紧凑且计算上实用的表征。
从一个大型医疗系统获取了2013年3月至2019年7月期间210万例成人和儿科急诊就诊的回顾性数据。由于缺乏结构化或非结构化的主诉,共排除了来自65737名(8.9%)患者的355497次(16.4%)就诊。为确保有足够的训练集规模,排除了占所有就诊次数不到0.01%(即万分之一)的主诉标签。截止阈值以对数尺度递增,以创建七个稀疏度递减的数据集。分类任务是使用BERT从自由文本主诉中预测提供者分配的标签,以长短期记忆(LSTM)和语言模型嵌入(ELMo)作为基线。在包含5%样本的留出测试集上,性能以k=1:5时的前k准确率来衡量。每个自由文本主诉的嵌入被提取为BERT模型的最后768维层,并使用t分布随机邻域嵌入(t-SNE)进行可视化。
随着稀疏度递减的数据集,模型的性能不断提高,BERT的表现优于LSTM和ELMo。在分别由434个和188个标签组成的数据集上,BERT模型的前1准确率分别为0.65和0.69,前3准确率分别为0.87和0.90,前5准确率分别为0.92和0.94。使用t-SNE进行的可视化以一种具有临床意义的方式映射了学习到的嵌入,相关概念彼此靠近嵌入,更广泛类型的主诉聚集在一起。
尽管主诉标签空间中存在固有噪声,但该模型能够学习到丰富的主诉表征,并对其标签做出合理预测。学习到的嵌入能够准确预测提供者分配的主诉标签,并将语义相似的主诉映射到向量空间中的附近点。
这样的模型可用于自动将自由文本主诉映射到结构化字段,并有助于为医疗机构开发标准化的、数据驱动的主诉本体。