文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

生物医学自然语言处理中词嵌入的比较。

A comparison of word embeddings for the biomedical natural language processing.

机构信息

Department of Health Sciences Research, Mayo Clinic, Rochester, USA.

出版信息

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.


DOI:10.1016/j.jbi.2018.09.008
PMID:30217670
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6585427/
Abstract

BACKGROUND: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. METHODS: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings' ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. RESULTS: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts' judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. CONCLUSION: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts' judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.

摘要

背景:由于向量表示能够捕获有用的语义属性和单词之间的语言关系,因此词嵌入在生物医学自然语言处理(NLP)应用中得到了广泛应用。生物医学 NLP 中利用了不同的文本资源(例如,维基百科和生物医学文献语料库)来训练词嵌入,这些词嵌入通常被用作下游机器学习模型的特征输入。然而,对于从不同文本资源训练的词嵌入的评估,几乎没有工作。

方法:在这项研究中,我们从四个不同的语料库(即临床笔记、生物医学出版物、维基百科和新闻)中进行了实证评估。对于前两个资源,我们分别使用 Mayo 诊所提供的非结构化电子健康记录(EHR)数据和来自 PubMed Central 的文章(MedLit)来训练词嵌入。对于后两个资源,我们使用了可公开获得的预训练词嵌入 GloVe 和 Google News。评估是定性和定量进行的。对于定性评估,我们从三个类别(即疾病、症状和药物)中随机选择了医学术语,并手动检查了为每个术语计算的五个最相似的词。我们还通过 377 个医学术语的二维可视化图来分析词嵌入。对于定量评估,我们进行了内在和外在评估。对于内在评估,我们通过使用四个已发布的数据集来评估词嵌入捕捉医学语义的能力:Pedersen 的数据集、Hliaoutakis 的数据集、MayoSRS 和 UMNSRS。对于外在评估,我们将词嵌入应用于多个下游生物医学 NLP 应用程序,包括来自共享任务的临床信息提取(IE)、生物医学信息检索(IR)和关系提取(RE)。

结果:定性评估表明,从 EHR 和 MedLit 训练的词嵌入可以找到比从 GloVe 和 Google News 训练的词嵌入更多的相似医学术语。内在的定量评估验证了从 EHR 训练的词嵌入捕捉到的语义与人类专家在所有四个测试数据集上的判断更为接近。外在的定量评估表明,在临床 IE 任务中,从 EHR 训练的词嵌入的 F1 得分最高为 0.900;没有词嵌入可以提高生物医学 IR 任务的性能;在 RE 任务中,从 Google News 训练的词嵌入的整体 F1 得分最佳为 0.790。

结论:根据评估结果,我们可以得出以下结论。首先,从 EHR 和 MedLit 训练的词嵌入可以更好地捕获医学术语的语义,并且找到与人类专家判断更相关的语义相关医学术语,而不是从 GloVe 和 Google News 训练的词嵌入。其次,对于所有下游生物医学 NLP 应用程序,并不存在一致的全局词嵌入排名。但是,添加词嵌入作为额外特征将提高大多数下游任务的结果。最后,对于任何下游生物医学 NLP 任务,从生物医学领域语料库训练的词嵌入并不一定比从一般领域语料库训练的词嵌入具有更好的性能。

相似文献

[1]
A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018-9-12

[2]
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.

BMC Med Inform Decis Mak. 2018-7-23

[3]
The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.

Stud Health Technol Inform. 2020-6-16

[4]
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.

J Biomed Inform. 2019-6-27

[5]
Improved biomedical word embeddings in the transformer era.

J Biomed Inform. 2021-8

[6]
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.

PLoS Comput Biol. 2020-4-23

[7]
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.

BMC Med Inform Decis Mak. 2020-4-30

[8]
Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine.

BMC Bioinformatics. 2018-2-5

[9]
Domain specific word embeddings for natural language processing in radiology.

J Biomed Inform. 2021-1

[10]
A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

BMC Med Inform Decis Mak. 2019-12-5

引用本文的文献

[1]
Using semantic search to find publicly available gene-expression datasets.

bioRxiv. 2025-3-15

[2]
GraphDeep-hERG: Graph Neural Network PharmacoAnalytics for Assessing hERG-Related Cardiotoxicity.

Pharm Res. 2025-4

[3]
Utility of word embeddings from large language models in medical diagnosis.

J Am Med Inform Assoc. 2025-3-1

[4]
Semantic matching in GUI test reuse.

Empir Softw Eng. 2024

[5]
AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends.

3D Print Addit Manuf. 2024-8-20

[6]
Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.

Sci Rep. 2024-8-30

[7]
Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment.

BMC Med Inform Decis Mak. 2024-2-7

[8]
The Coming of Age of AI/ML in Drug Discovery, Development, Clinical Testing, and Manufacturing: The FDA Perspectives.

Drug Des Devel Ther. 2023

[9]
Quality of word and concept embeddings in targetted biomedical domains.

Heliyon. 2023-6-2

[10]
Can Patients with Dementia Be Identified in Primary Care Electronic Medical Records Using Natural Language Processing?

J Healthc Inform Res. 2023-1-23

本文引用的文献

[1]
Privacy-Preserving Predictive Modeling: Harmonization of Contextual Embeddings From Different Sources.

JMIR Med Inform. 2018-5-16

[2]
Clinical information extraction applications: A literature review.

J Biomed Inform. 2017-11-21

[3]
Predicate Oriented Pattern Analysis for Biomedical Knowledge Discovery.

Intell Inf Manag. 2016-5

[4]
Systematic Analysis of Free-Text Family History in Electronic Health Record.

AMIA Jt Summits Transl Sci Proc. 2017-7-26

[5]
Knowledge Discovery from Biomedical Ontologies in Cross Domains.

PLoS One. 2016-8-22

[6]
Corpus domain effects on distributional semantic modeling of medical terms.

Bioinformatics. 2016-12-1

[7]
MIMIC-III, a freely accessible critical care database.

Sci Data. 2016-5-24

[8]
Drug-Drug Interaction Extraction via Convolutional Neural Networks.

Comput Math Methods Med. 2016

[9]
Evaluating word representation features in biomedical named entity recognition tasks.

Biomed Res Int. 2014-3-6

[10]
Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

AMIA Annu Symp Proc. 2010-11-13

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索