BioWordVec，利用子词信息和 MeSH 改进生物医学词向量。

BioWordVec, improving biomedical word embeddings with subword information and MeSH.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA.

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China.

出版信息

Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.

DOI:10.1038/s41597-019-0055-0

PMID:31076572

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6510737/

Abstract

Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.

摘要

分布式单词表示已成为生物医学自然语言处理 (BioNLP)、文本挖掘和信息检索的重要基础。单词嵌入通常是从大量未标记的文本中在单词级别上计算的，忽略了单词内部结构中存在的信息或任何在特定领域的结构化资源（如本体）中可用的信息。然而，正如一些在一般领域的最近研究中所表明的那样，这种信息具有极大地提高单词表示质量的潜力。在这里，我们提出了 BioWordVec：一组开放的生物医学单词向量/嵌入，它结合了来自未标记的生物医学文本的子词信息和一种广泛使用的生物医学受控词汇，称为医学主题词 (MeSH)。我们在多个生物医学领域的 NLP 任务上评估了我们生成的单词嵌入的有效性和实用性。我们的基准测试结果表明，在这些具有挑战性的任务中，我们的单词嵌入可以显著提高性能，超过之前的最先进水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1e5/6510737/55dd7daff63c/41597_2019_55_Fig1_HTML.jpg

相似文献

BioWordVec, improving biomedical word embeddings with subword information and MeSH.BioWordVec，利用子词信息和 MeSH 改进生物医学词向量。

Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.

Improved biomedical word embeddings in the transformer era.Transformer 时代改进的生物医学词向量。

J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

deepBioWSD: effective deep neural word sense disambiguation of biomedical text data.深度生物词汇语义消歧：生物医学文本数据的有效深度神经网络词汇语义消歧。

J Am Med Inform Assoc. 2019 May 1;26(5):438-446. doi: 10.1093/jamia/ocy189.

Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.多本体精炼嵌入模型（MORE）：一种基于混合多本体和语料库的生物医学概念语义表示模型。

J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

Jointly learning word embeddings using a corpus and a knowledge base.联合使用语料库和知识库学习词向量。

PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.

Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.用于量化生物医学文本摘要是信息内容的深度语境化嵌入。

Comput Methods Programs Biomed. 2020 Feb;184:105117. doi: 10.1016/j.cmpb.2019.105117. Epub 2019 Oct 4.

Knowledge based word-concept model estimation and refinement for biomedical text mining.用于生物医学文本挖掘的基于知识的词概念模型估计与优化。

J Biomed Inform. 2015 Feb;53:300-7. doi: 10.1016/j.jbi.2014.11.015. Epub 2014 Dec 12.

引用本文的文献

Knowledge based convolutional transformer for joint estimation of PM and O concentrations.用于联合估计颗粒物（PM）和臭氧（O）浓度的基于知识的卷积变换器

Sci Rep. 2025 Jul 14;15(1):25340. doi: 10.1038/s41598-025-95019-5.

Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.从生物医学BERT模型中嵌入的参数知识预测药物副作用关系：一种自然语言处理方法的方法学研究

JMIR Med Inform. 2025 Jul 10;13:e67513. doi: 10.2196/67513.

Benchmarking pre-trained text embedding models in aligning built asset information.在对齐建筑资产信息方面对预训练文本嵌入模型进行基准测试。

Sci Rep. 2025 Jul 4;15(1):23866. doi: 10.1038/s41598-025-09052-5.

CSpace: a concept embedding space for biomedical applications.CSpace：一种用于生物医学应用的概念嵌入空间。

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf376.

Machine learning to predict penumbra core mismatch in acute ischemic stroke using clinical note data.利用临床记录数据，通过机器学习预测急性缺血性卒中的半暗带核心不匹配情况。

NPJ Digit Med. 2025 Jun 6;8(1):340. doi: 10.1038/s41746-025-01703-1.

Augmented Ensemble Model (AEM) for health trends prediction on social networks.用于社交网络健康趋势预测的增强集成模型（AEM）。

PLoS One. 2025 Jun 5;20(6):e0323449. doi: 10.1371/journal.pone.0323449. eCollection 2025.

Use of deep learning-based NLP models for full-text data elements extraction for systematic literature review tasks.基于深度学习的自然语言处理模型在系统文献综述任务的全文数据元素提取中的应用。

Sci Rep. 2025 Jun 3;15(1):19379. doi: 10.1038/s41598-025-03979-5.

Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study.使用检索增强大语言模型预测术后30天死亡率和美国麻醉医师协会身体状况：开发与验证研究

J Med Internet Res. 2025 Jun 3;27:e75052. doi: 10.2196/75052.

FedWeight: mitigating covariate shift of federated learning on electronic health records data through patients re-weighting.FedWeight：通过患者重新加权减轻联邦学习在电子健康记录数据上的协变量偏移

NPJ Digit Med. 2025 May 17;8(1):286. doi: 10.1038/s41746-025-01661-8.

Transformer-Based Language Models for Group Randomized Trial Classification in Biomedical Literature: Model Development and Validation.基于Transformer的语言模型用于生物医学文献中的群组随机试验分类：模型开发与验证

JMIR Med Inform. 2025 May 9;13:e63267. doi: 10.2196/63267.

本文引用的文献

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.用于单词、短语和文本的无监督低维向量表示，具有透明性、可扩展性，并能产生与神经嵌入不冗余的相似性度量。

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths.基于序列和最短依赖路径的分层 RNN 进行药物-药物相互作用提取。

Bioinformatics. 2018 Mar 1;34(5):828-835. doi: 10.1093/bioinformatics/btx659.

Strategies towards digital and semi-automated curation in RegulonDB.RegulonDB中数字和半自动管理的策略。

Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax012.

node2vec: Scalable Feature Learning for Networks.节点2向量：网络的可扩展特征学习

KDD. 2016 Aug;2016:855-864. doi: 10.1145/2939672.2939754.

Drug drug interaction extraction from biomedical literature using syntax convolutional neural network.使用句法卷积神经网络从生物医学文献中提取药物相互作用

Bioinformatics. 2016 Nov 15;32(22):3444-3453. doi: 10.1093/bioinformatics/btw486. Epub 2016 Jul 27.

BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph.使用扩展依赖图进行蛋白质-蛋白质相互作用的生物相容性全文段落检测。

Database (Oxford). 2016 May 11;2016. doi: 10.1093/database/baw072. Print 2016.

Lessons learnt from the DDIExtraction-2013 Shared Task.从2013年DDIExtraction共享任务中吸取的经验教训。

J Biomed Inform. 2014 Oct;51:152-64. doi: 10.1016/j.jbi.2014.05.007. Epub 2014 May 21.

The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions.DDI 语料库：一个带有药理学物质和药物相互作用注释的语料库。

J Biomed Inform. 2013 Oct;46(5):914-20. doi: 10.1016/j.jbi.2013.07.011. Epub 2013 Jul 29.

AMIA Annu Symp Proc. 2010 Nov 13;2010:572-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

BioWordVec，利用子词信息和 MeSH 改进生物医学词向量。

BioWordVec, improving biomedical word embeddings with subword information and MeSH.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献