• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PMCVec:用于生物医学文本处理的分布式短语表示

PMCVec: Distributed phrase representation for biomedical text processing.

作者信息

Gero Zelalem, Ho Joyce

机构信息

Emory University, Department of Computer Science, Atlanta, USA.

Emory University, Department of Computer Science, Atlanta, USA.

出版信息

J Biomed Inform. 2019;100S:100047. doi: 10.1016/j.yjbinx.2019.100047. Epub 2019 Jul 20.

DOI:10.1016/j.yjbinx.2019.100047
PMID:34384576
Abstract

Distributed semantic representation of biomedical text can be beneficial for text classification, named entity recognition, query expansion, human comprehension, and information retrieval. Despite the success of high-quality vector space models such as Word2Vec and GloVe, they only provide unigram word representations and the semantics for multi-word phrases can only be approximated by composition. This is problematic in biomedical text processing where technical phrases for diseases, symptoms, and drugs should be represented as single entities to capture the correct meaning. In this paper, we introduce PMCVec, an unsupervised technique that generates important phrases from PubMed abstracts and learns embeddings for single words and multi-word phrases simultaneously. Evaluations performed on benchmark datasets produce significant performance gains both qualitatively and quantitatively.

摘要

生物医学文本的分布式语义表示对于文本分类、命名实体识别、查询扩展、人类理解和信息检索可能是有益的。尽管诸如Word2Vec和GloVe等高质量向量空间模型取得了成功,但它们仅提供单字单词表示,多词短语的语义只能通过组合来近似。这在生物医学文本处理中是有问题的,因为疾病、症状和药物的技术短语应表示为单个实体以捕捉正确含义。在本文中,我们介绍了PMCVec,这是一种无监督技术,可从PubMed摘要中生成重要短语,并同时学习单字单词和多词短语的嵌入。在基准数据集上进行的评估在定性和定量方面都产生了显著的性能提升。

相似文献

1
PMCVec: Distributed phrase representation for biomedical text processing.PMCVec:用于生物医学文本处理的分布式短语表示
J Biomed Inform. 2019;100S:100047. doi: 10.1016/j.yjbinx.2019.100047. Epub 2019 Jul 20.
2
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.用于单词、短语和文本的无监督低维向量表示,具有透明性、可扩展性,并能产生与神经嵌入不冗余的相似性度量。
J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.
3
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
4
Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.基于分布和关系上下文的增强词表示法进行生物医学文本分类
Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.
5
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.
6
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
7
Comparing general and specialized word embeddings for biomedical named entity recognition.比较用于生物医学命名实体识别的通用词嵌入和专用词嵌入。
PeerJ Comput Sci. 2021 Feb 18;7:e384. doi: 10.7717/peerj-cs.384. eCollection 2021.
8
Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.通用和特定词嵌入在研究转化阶段分类中的效用
AMIA Annu Symp Proc. 2018 Dec 5;2018:1405-1414. eCollection 2018.
9
Jointly learning word embeddings using a corpus and a knowledge base.联合使用语料库和知识库学习词向量。
PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.
10
An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining.一种用于生物医学文本挖掘的基于无监督图的连续词表示方法。
IEEE/ACM Trans Comput Biol Bioinform. 2016 Jul-Aug;13(4):634-42. doi: 10.1109/TCBB.2015.2478467. Epub 2015 Sep 14.

引用本文的文献

1
A scoping review of preprocessing methods for unstructured text data to assess data quality.对非结构化文本数据进行预处理以评估数据质量的范围回顾。
Int J Popul Data Sci. 2022 Oct 4;7(1):1757. doi: 10.23889/ijpds.v6i1.1757. eCollection 2022.
2
MMiDaS-AE: Multi-modal Missing Data aware Stacked Autoencoder for Biomedical Abstract Screening.MMiDaS-AE:用于生物医学摘要筛选的多模态缺失数据感知堆叠自动编码器
Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:139-150. doi: 10.1145/3368555.3384463. Epub 2020 Apr 2.