• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物医学自然语言处理中词嵌入的比较。

A comparison of word embeddings for the biomedical natural language processing.

机构信息

Department of Health Sciences Research, Mayo Clinic, Rochester, USA.

出版信息

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

DOI:10.1016/j.jbi.2018.09.008
PMID:30217670
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6585427/
Abstract

BACKGROUND

Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources.

METHODS

In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings' ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks.

RESULTS

The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts' judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task.

CONCLUSION

Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts' judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.

摘要

背景

由于向量表示能够捕获有用的语义属性和单词之间的语言关系,因此词嵌入在生物医学自然语言处理(NLP)应用中得到了广泛应用。生物医学 NLP 中利用了不同的文本资源(例如,维基百科和生物医学文献语料库)来训练词嵌入,这些词嵌入通常被用作下游机器学习模型的特征输入。然而,对于从不同文本资源训练的词嵌入的评估,几乎没有工作。

方法

在这项研究中,我们从四个不同的语料库(即临床笔记、生物医学出版物、维基百科和新闻)中进行了实证评估。对于前两个资源,我们分别使用 Mayo 诊所提供的非结构化电子健康记录(EHR)数据和来自 PubMed Central 的文章(MedLit)来训练词嵌入。对于后两个资源,我们使用了可公开获得的预训练词嵌入 GloVe 和 Google News。评估是定性和定量进行的。对于定性评估,我们从三个类别(即疾病、症状和药物)中随机选择了医学术语,并手动检查了为每个术语计算的五个最相似的词。我们还通过 377 个医学术语的二维可视化图来分析词嵌入。对于定量评估,我们进行了内在和外在评估。对于内在评估,我们通过使用四个已发布的数据集来评估词嵌入捕捉医学语义的能力:Pedersen 的数据集、Hliaoutakis 的数据集、MayoSRS 和 UMNSRS。对于外在评估,我们将词嵌入应用于多个下游生物医学 NLP 应用程序,包括来自共享任务的临床信息提取(IE)、生物医学信息检索(IR)和关系提取(RE)。

结果

定性评估表明,从 EHR 和 MedLit 训练的词嵌入可以找到比从 GloVe 和 Google News 训练的词嵌入更多的相似医学术语。内在的定量评估验证了从 EHR 训练的词嵌入捕捉到的语义与人类专家在所有四个测试数据集上的判断更为接近。外在的定量评估表明,在临床 IE 任务中,从 EHR 训练的词嵌入的 F1 得分最高为 0.900;没有词嵌入可以提高生物医学 IR 任务的性能;在 RE 任务中,从 Google News 训练的词嵌入的整体 F1 得分最佳为 0.790。

结论

根据评估结果,我们可以得出以下结论。首先,从 EHR 和 MedLit 训练的词嵌入可以更好地捕获医学术语的语义,并且找到与人类专家判断更相关的语义相关医学术语,而不是从 GloVe 和 Google News 训练的词嵌入。其次,对于所有下游生物医学 NLP 应用程序,并不存在一致的全局词嵌入排名。但是,添加词嵌入作为额外特征将提高大多数下游任务的结果。最后,对于任何下游生物医学 NLP 任务,从生物医学领域语料库训练的词嵌入并不一定比从一般领域语料库训练的词嵌入具有更好的性能。

相似文献

1
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
2
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.
3
The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.专业语料库对自然语言理解中词嵌入的影响。
Stud Health Technol Inform. 2020 Jun 16;270:432-436. doi: 10.3233/SHTI200197.
4
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+:利用异构知识资源丰富人类表型本体的节点嵌入。
J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.
5
Improved biomedical word embeddings in the transformer era.Transformer 时代改进的生物医学词向量。
J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.
6
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
7
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.基于生物医学语料库预训练的句子嵌入的深度学习提高了在电子病历中查找相似句子的性能。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.
8
Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine.生物模拟动词和生物模拟词汇:生物医学中词汇相似度的广泛覆盖评估集。
BMC Bioinformatics. 2018 Feb 5;19(1):33. doi: 10.1186/s12859-018-2039-z.
9
Domain specific word embeddings for natural language processing in radiology.用于放射学自然语言处理的特定领域词嵌入
J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.
10
A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

引用本文的文献

1
Using semantic search to find publicly available gene-expression datasets.使用语义搜索来查找公开可用的基因表达数据集。
bioRxiv. 2025 Mar 15:2025.03.13.643153. doi: 10.1101/2025.03.13.643153.
2
GraphDeep-hERG: Graph Neural Network PharmacoAnalytics for Assessing hERG-Related Cardiotoxicity.GraphDeep-hERG:用于评估与hERG相关心脏毒性的图神经网络药物分析
Pharm Res. 2025 Apr;42(4):579-591. doi: 10.1007/s11095-025-03848-w. Epub 2025 Mar 26.
3
Utility of word embeddings from large language models in medical diagnosis.来自大语言模型的词嵌入在医学诊断中的效用。

本文引用的文献

1
Privacy-Preserving Predictive Modeling: Harmonization of Contextual Embeddings From Different Sources.隐私保护预测建模:不同来源上下文嵌入的协调
JMIR Med Inform. 2018 May 16;6(2):e33. doi: 10.2196/medinform.9455.
2
Clinical information extraction applications: A literature review.临床信息提取应用:文献综述。
J Biomed Inform. 2018 Jan;77:34-49. doi: 10.1016/j.jbi.2017.11.011. Epub 2017 Nov 21.
3
Predicate Oriented Pattern Analysis for Biomedical Knowledge Discovery.面向谓词的生物医学知识发现模式分析
J Am Med Inform Assoc. 2025 Mar 1;32(3):526-534. doi: 10.1093/jamia/ocae314.
4
Semantic matching in GUI test reuse.图形用户界面测试复用中的语义匹配
Empir Softw Eng. 2024;29(3):70. doi: 10.1007/s10664-023-10406-8. Epub 2024 May 9.
5
AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends.基于人工智能从生物打印文献中提取知识以识别技术趋势。
3D Print Addit Manuf. 2024 Aug 20;11(4):1495-1509. doi: 10.1089/3dp.2022.0316. eCollection 2024 Aug.
6
Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.采用加权评估方法的层次聚类在药物科学中的文本摘要。
Sci Rep. 2024 Aug 30;14(1):20149. doi: 10.1038/s41598-024-70618-w.
7
Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment.探索微调后的 BERT 模型在神经放射学协议分配中的性能和可解释性。
BMC Med Inform Decis Mak. 2024 Feb 7;24(1):40. doi: 10.1186/s12911-024-02444-z.
8
The Coming of Age of AI/ML in Drug Discovery, Development, Clinical Testing, and Manufacturing: The FDA Perspectives.人工智能/机器学习在药物发现、开发、临床测试和制造中的崭露头角:FDA 的观点。
Drug Des Devel Ther. 2023 Sep 6;17:2691-2725. doi: 10.2147/DDDT.S424991. eCollection 2023.
9
Quality of word and concept embeddings in targetted biomedical domains.靶向生物医学领域中词和概念嵌入的质量。
Heliyon. 2023 Jun 2;9(6):e16818. doi: 10.1016/j.heliyon.2023.e16818. eCollection 2023 Jun.
10
Can Patients with Dementia Be Identified in Primary Care Electronic Medical Records Using Natural Language Processing?能否使用自然语言处理在初级保健电子病历中识别痴呆症患者?
J Healthc Inform Res. 2023 Jan 23;7(1):42-58. doi: 10.1007/s41666-023-00125-6. eCollection 2023 Mar.
Intell Inf Manag. 2016 May;8(3):66-85. doi: 10.4236/iim.2016.83006.
4
Systematic Analysis of Free-Text Family History in Electronic Health Record.电子健康记录中自由文本家族病史的系统分析
AMIA Jt Summits Transl Sci Proc. 2017 Jul 26;2017:104-113. eCollection 2017.
5
Knowledge Discovery from Biomedical Ontologies in Cross Domains.跨领域生物医学本体中的知识发现
PLoS One. 2016 Aug 22;11(8):e0160005. doi: 10.1371/journal.pone.0160005. eCollection 2016.
6
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.
7
MIMIC-III, a freely accessible critical care database.MIMIC-III,一个免费获取的重症监护数据库。
Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.
8
Drug-Drug Interaction Extraction via Convolutional Neural Networks.通过卷积神经网络进行药物-药物相互作用提取
Comput Math Methods Med. 2016;2016:6918381. doi: 10.1155/2016/6918381. Epub 2016 Jan 31.
9
Evaluating word representation features in biomedical named entity recognition tasks.评估生物医学命名实体识别任务中的词表示特征。
Biomed Res Int. 2014;2014:240403. doi: 10.1155/2014/240403. Epub 2014 Mar 6.
10
Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.临床术语之间的语义相似性和相关性:一项实验研究。
AMIA Annu Symp Proc. 2010 Nov 13;2010:572-6.