Link Nicholas B, Huang Sicong, Cai Tianrun, Sun Jiehuan, Dahal Kumar, Costa Lauren, Cho Kelly, Liao Katherine, Cai Tianxi, Hong Chuan
VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States.
Int J Med Inform. 2022 Apr 1;162:104753. doi: 10.1016/j.ijmedinf.2022.104753.
The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce a semi-supervised method for binary acronym disambiguation, the task of classifying a target sense for acronyms in the clinical EHR notes.
We developed a semi-supervised ensemble machine learning (CASEml) algorithm to automatically identify when an acronym means a target sense by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard semi-supervised method and a baseline metric selecting the most frequent acronym sense. Along with evaluating the performance of these methods for specific instances of acronyms, we evaluated the impact of acronym disambiguation on NLP-driven phenotyping of rheumatoid arthritis.
CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art semi-supervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis.
CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and semi-supervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.
在过去十年中,电子健康记录(EHR)系统的使用不断增加,随之而来的是从非结构化临床叙述中提取信息的需求。然而,临床记录中经常包含具有多种潜在含义(语义)的首字母缩略词,传统的自然语言处理(NLP)技术无法区分这些语义。在本研究中,我们介绍了一种用于二元首字母缩略词消歧的半监督方法,即对临床EHR记录中的首字母缩略词进行目标语义分类的任务。
我们开发了一种半监督集成机器学习(CASEml)算法,通过利用语义嵌入、就诊级文本和计费信息来自动识别首字母缩略词何时表示目标语义。该算法使用退伍军人事务医院系统的记录数据进行验证,以对三个首字母缩略词的含义进行分类:RA、MS和MI。我们将CASEml的性能与另一种标准半监督方法以及选择最常见首字母缩略词语义的基线指标进行了比较。除了评估这些方法在首字母缩略词特定实例上的性能外,我们还评估了首字母缩略词消歧对类风湿性关节炎的NLP驱动表型分析的影响。
CASEml对RA、MS和MI的准确率分别达到0.947、0.911和0.706,高于标准基线指标,并且(平均)高于一种先进的半监督方法。此外,我们证明将CASEml应用于医疗记录可提高类风湿性关节炎表型算法的AUC。
CASEml是一种新颖的方法,能够准确消除临床记录中首字母缩略词的歧义,并且比常用的监督和半监督机器学习方法具有优势。此外,CASEml提高了依赖于模糊首字母缩略词的NLP任务的性能,例如表型分析。