Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
BACKGROUND: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. METHODS: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings' ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. RESULTS: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts' judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. CONCLUSION: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts' judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.
背景:由于向量表示能够捕获有用的语义属性和单词之间的语言关系,因此词嵌入在生物医学自然语言处理(NLP)应用中得到了广泛应用。生物医学 NLP 中利用了不同的文本资源(例如,维基百科和生物医学文献语料库)来训练词嵌入,这些词嵌入通常被用作下游机器学习模型的特征输入。然而,对于从不同文本资源训练的词嵌入的评估,几乎没有工作。
方法:在这项研究中,我们从四个不同的语料库(即临床笔记、生物医学出版物、维基百科和新闻)中进行了实证评估。对于前两个资源,我们分别使用 Mayo 诊所提供的非结构化电子健康记录(EHR)数据和来自 PubMed Central 的文章(MedLit)来训练词嵌入。对于后两个资源,我们使用了可公开获得的预训练词嵌入 GloVe 和 Google News。评估是定性和定量进行的。对于定性评估,我们从三个类别(即疾病、症状和药物)中随机选择了医学术语,并手动检查了为每个术语计算的五个最相似的词。我们还通过 377 个医学术语的二维可视化图来分析词嵌入。对于定量评估,我们进行了内在和外在评估。对于内在评估,我们通过使用四个已发布的数据集来评估词嵌入捕捉医学语义的能力:Pedersen 的数据集、Hliaoutakis 的数据集、MayoSRS 和 UMNSRS。对于外在评估,我们将词嵌入应用于多个下游生物医学 NLP 应用程序,包括来自共享任务的临床信息提取(IE)、生物医学信息检索(IR)和关系提取(RE)。
结果:定性评估表明,从 EHR 和 MedLit 训练的词嵌入可以找到比从 GloVe 和 Google News 训练的词嵌入更多的相似医学术语。内在的定量评估验证了从 EHR 训练的词嵌入捕捉到的语义与人类专家在所有四个测试数据集上的判断更为接近。外在的定量评估表明,在临床 IE 任务中,从 EHR 训练的词嵌入的 F1 得分最高为 0.900;没有词嵌入可以提高生物医学 IR 任务的性能;在 RE 任务中,从 Google News 训练的词嵌入的整体 F1 得分最佳为 0.790。
结论:根据评估结果,我们可以得出以下结论。首先,从 EHR 和 MedLit 训练的词嵌入可以更好地捕获医学术语的语义,并且找到与人类专家判断更相关的语义相关医学术语,而不是从 GloVe 和 Google News 训练的词嵌入。其次,对于所有下游生物医学 NLP 应用程序,并不存在一致的全局词嵌入排名。但是,添加词嵌入作为额外特征将提高大多数下游任务的结果。最后,对于任何下游生物医学 NLP 任务,从生物医学领域语料库训练的词嵌入并不一定比从一般领域语料库训练的词嵌入具有更好的性能。
J Biomed Inform. 2018-9-12
BMC Med Inform Decis Mak. 2018-7-23
Stud Health Technol Inform. 2020-6-16
J Biomed Inform. 2021-8
PLoS Comput Biol. 2020-4-23
BMC Bioinformatics. 2018-2-5
J Biomed Inform. 2021-1
BMC Med Inform Decis Mak. 2019-12-5
J Am Med Inform Assoc. 2025-3-1
Empir Softw Eng. 2024
3D Print Addit Manuf. 2024-8-20
BMC Med Inform Decis Mak. 2024-2-7
J Healthc Inform Res. 2023-1-23
JMIR Med Inform. 2018-5-16
J Biomed Inform. 2017-11-21
Intell Inf Manag. 2016-5
AMIA Jt Summits Transl Sci Proc. 2017-7-26
PLoS One. 2016-8-22
Bioinformatics. 2016-12-1
Sci Data. 2016-5-24
Comput Math Methods Med. 2016
Biomed Res Int. 2014-3-6
AMIA Annu Symp Proc. 2010-11-13