Parikh Soham, Davoudi Anahita, Yu Shun, Giraldo Carolina, Schriver Emily, Mowery Danielle
School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, United States.
Department of Biostatistics, Epidemiology, & Informatics, University of Pennsylvania, Philadelphia, PA, United States.
JMIR Med Inform. 2021 Feb 22;9(2):e21679. doi: 10.2196/21679.
Scientists are developing new computational methods and prediction models to better clinically understand COVID-19 prevalence, treatment efficacy, and patient outcomes. These efforts could be improved by leveraging documented COVID-19-related symptoms, findings, and disorders from clinical text sources in an electronic health record. Word embeddings can identify terms related to these clinical concepts from both the biomedical and nonbiomedical domains, and are being shared with the open-source community at large. However, it's unclear how useful openly available word embeddings are for developing lexicons for COVID-19-related concepts.
Given an initial lexicon of COVID-19-related terms, this study aims to characterize the returned terms by similarity across various open-source word embeddings and determine common semantic and syntactic patterns between the COVID-19 queried terms and returned terms specific to the word embedding source.
We compared seven openly available word embedding sources. Using a series of COVID-19-related terms for associated symptoms, findings, and disorders, we conducted an interannotator agreement study to determine how accurately the most similar returned terms could be classified according to semantic types by three annotators. We conducted a qualitative study of COVID-19 queried terms and their returned terms to detect informative patterns for constructing lexicons. We demonstrated the utility of applying such learned synonyms to discharge summaries by reporting the proportion of patients identified by concept among three patient cohorts: pneumonia (n=6410), acute respiratory distress syndrome (n=8647), and COVID-19 (n=2397).
We observed high pairwise interannotator agreement (Cohen kappa) for symptoms (0.86-0.99), findings (0.93-0.99), and disorders (0.93-0.99). Word embedding sources generated based on characters tend to return more synonyms (mean count of 7.2 synonyms) compared to token-based embedding sources (mean counts range from 2.0 to 3.4). Word embedding sources queried using a qualifier term (eg, dry cough or muscle pain) more often returned qualifiers of the similar semantic type (eg, "dry" returns consistency qualifiers like "wet" and "runny") compared to a single term (eg, cough or pain) queries. A higher proportion of patients had documented fever (0.61-0.84), cough (0.41-0.55), shortness of breath (0.40-0.59), and hypoxia (0.51-0.56) retrieved than other clinical features. Terms for dry cough returned a higher proportion of patients with COVID-19 (0.07) than the pneumonia (0.05) and acute respiratory distress syndrome (0.03) populations.
Word embeddings are valuable technology for learning related terms, including synonyms. When leveraging openly available word embedding sources, choices made for the construction of the word embeddings can significantly influence the words learned.
科学家们正在开发新的计算方法和预测模型,以便在临床上更好地了解新型冠状病毒肺炎(COVID-19)的流行情况、治疗效果和患者预后。通过利用电子健康记录中临床文本来源记录的与COVID-19相关的症状、检查结果和病症,这些工作可以得到改进。词嵌入可以从生物医学和非生物医学领域识别与这些临床概念相关的术语,并且正在广泛地与开源社区共享。然而,尚不清楚公开可用的词嵌入对于开发与COVID-19相关概念的词汇表有多大用处。
给定一个与COVID-19相关术语的初始词汇表,本研究旨在通过各种开源词嵌入的相似性来表征返回的术语,并确定COVID-19查询术语与特定于词嵌入源的返回术语之间的共同语义和句法模式。
我们比较了七个公开可用的词嵌入源。使用一系列与COVID-19相关的术语来表示相关症状、检查结果和病症,我们进行了一项标注者间一致性研究,以确定三位标注者根据语义类型对最相似返回术语进行分类的准确程度。我们对COVID-19查询术语及其返回术语进行了定性研究,以检测构建词汇表的信息模式。通过报告在三个患者队列(肺炎,n = 6410;急性呼吸窘迫综合征,n = 8647;COVID-19,n = 2397)中按概念识别出的患者比例,我们展示了将此类学习到的同义词应用于出院小结的效用。
我们观察到症状(0.86 - 0.99)、检查结果(0.93 - 0.99)和病症(0.93 - 0.99)的两两标注者间一致性较高(Cohen卡方值)。与基于词元的嵌入源(平均同义词数量范围为2.0至3.4)相比,基于字符生成的词嵌入源往往返回更多同义词(平均同义词数量为7.2个)。与单个术语(如咳嗽或疼痛)查询相比,使用限定词(如干咳或肌肉疼痛)查询的词嵌入源更常返回相似语义类型的限定词(如“干”返回“湿”和“流鼻涕”等一致性限定词)。与其他临床特征相比,记录有发热(0.61 - 0.84)、咳嗽(0.41 - 0.55)、呼吸急促(0.40 - 0.59)和缺氧(0.51 - 0.56)的患者比例更高。干咳相关术语识别出的COVID-19患者比例(0.07)高于肺炎(0.05)和急性呼吸窘迫综合征(0.03)人群。
词嵌入是学习相关术语(包括同义词)的有价值技术。在利用公开可用的词嵌入源时,为构建词嵌入所做的选择会显著影响学到的词汇。