电子健康记录临床笔记中的二元首字母缩写词消歧及其在计算表型分析中的应用

Binary acronym disambiguation in clinical notes from electronic health records with an application in computational phenotyping.

作者信息

Link Nicholas B, Huang Sicong, Cai Tianrun, Sun Jiehuan, Dahal Kumar, Costa Lauren, Cho Kelly, Liao Katherine, Cai Tianxi, Hong Chuan

机构信息

VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.

VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States.

出版信息

Int J Med Inform. 2022 Apr 1;162:104753. doi: 10.1016/j.ijmedinf.2022.104753.

DOI:10.1016/j.ijmedinf.2022.104753

PMID:35405530

Abstract

OBJECTIVE

The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce a semi-supervised method for binary acronym disambiguation, the task of classifying a target sense for acronyms in the clinical EHR notes.

METHODS

We developed a semi-supervised ensemble machine learning (CASEml) algorithm to automatically identify when an acronym means a target sense by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard semi-supervised method and a baseline metric selecting the most frequent acronym sense. Along with evaluating the performance of these methods for specific instances of acronyms, we evaluated the impact of acronym disambiguation on NLP-driven phenotyping of rheumatoid arthritis.

RESULTS

CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art semi-supervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis.

CONCLUSION

CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and semi-supervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.

摘要

目的

在过去十年中，电子健康记录（EHR）系统的使用不断增加，随之而来的是从非结构化临床叙述中提取信息的需求。然而，临床记录中经常包含具有多种潜在含义（语义）的首字母缩略词，传统的自然语言处理（NLP）技术无法区分这些语义。在本研究中，我们介绍了一种用于二元首字母缩略词消歧的半监督方法，即对临床EHR记录中的首字母缩略词进行目标语义分类的任务。

方法

我们开发了一种半监督集成机器学习（CASEml）算法，通过利用语义嵌入、就诊级文本和计费信息来自动识别首字母缩略词何时表示目标语义。该算法使用退伍军人事务医院系统的记录数据进行验证，以对三个首字母缩略词的含义进行分类：RA、MS和MI。我们将CASEml的性能与另一种标准半监督方法以及选择最常见首字母缩略词语义的基线指标进行了比较。除了评估这些方法在首字母缩略词特定实例上的性能外，我们还评估了首字母缩略词消歧对类风湿性关节炎的NLP驱动表型分析的影响。