• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.用于单词、短语和文本的无监督低维向量表示,具有透明性、可扩展性,并能产生与神经嵌入不冗余的相似性度量。
J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.
2
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
3
PMCVec: Distributed phrase representation for biomedical text processing.PMCVec:用于生物医学文本处理的分布式短语表示
J Biomed Inform. 2019;100S:100047. doi: 10.1016/j.yjbinx.2019.100047. Epub 2019 Jul 20.
4
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.
5
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.
6
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.
7
Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.通用和特定词嵌入在研究转化阶段分类中的效用
AMIA Annu Symp Proc. 2018 Dec 5;2018:1405-1414. eCollection 2018.
8
Jointly learning word embeddings using a corpus and a knowledge base.联合使用语料库和知识库学习词向量。
PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.
9
Vector representations of multi-word terms for semantic relatedness.多词术语的语义关联的向量表示。
J Biomed Inform. 2018 Jan;77:111-119. doi: 10.1016/j.jbi.2017.12.006. Epub 2017 Dec 13.
10
Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.基于分布和关系上下文的增强词表示法进行生物医学文本分类
Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.

引用本文的文献

1
Refining electronic medical records representation in manifold subspace.在流形子空间中细化电子病历表示。
BMC Bioinformatics. 2022 Apr 1;23(1):115. doi: 10.1186/s12859-022-04653-7.
2
A web-based tool for automatically linking clinical trials to their publications.一个用于自动将临床试验与其出版物进行链接的网络工具。
J Am Med Inform Assoc. 2022 Apr 13;29(5):822-830. doi: 10.1093/jamia/ocab290.
3
Anne O'Tate: Value-added PubMed search engine for analysis and text mining.安妮·奥泰特:用于分析和文本挖掘的增值 PubMed 搜索引擎。
PLoS One. 2021 Mar 8;16(3):e0248335. doi: 10.1371/journal.pone.0248335. eCollection 2021.
4
BioWordVec, improving biomedical word embeddings with subword information and MeSH.BioWordVec,利用子词信息和 MeSH 改进生物医学词向量。
Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.
5
Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.设计一个通用的开放平台,用于在生物医学文献数据库PubMed中对文章进行机器学习辅助索引和聚类。
Data Inf Manag. 2018 Jun;2(1):27-36. doi: 10.2478/dim-2018-0004. Epub 2018 May 22.

本文引用的文献

1
Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.设计一个通用的开放平台,用于在生物医学文献数据库PubMed中对文章进行机器学习辅助索引和聚类。
Data Inf Manag. 2018 Jun;2(1):27-36. doi: 10.2478/dim-2018-0004. Epub 2018 May 22.
2
Vector representations of multi-word terms for semantic relatedness.多词术语的语义关联的向量表示。
J Biomed Inform. 2018 Jan;77:111-119. doi: 10.1016/j.jbi.2017.12.006. Epub 2017 Dec 13.
3
Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec.生物医学术语的语义相关性和相似性:研究生物医学出版物的时效性、篇幅大小和章节对word2vec性能的影响。
BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.
4
Representing Documents via Latent Keyphrase Inference.通过潜在关键短语推理来表示文档。
Proc Int World Wide Web Conf. 2016 Apr;2016:1057-1067. doi: 10.1145/2872427.2883088.
5
Finding Related Publications: Extending the Set of Terms Used to Assess Article Similarity.查找相关出版物:扩展用于评估文章相似度的术语集。
AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:225-34. eCollection 2016.
6
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.
7
Topic detection using paragraph vectors to support active learning in systematic reviews.使用段落向量进行主题检测以支持系统评价中的主动学习
J Biomed Inform. 2016 Aug;62:59-65. doi: 10.1016/j.jbi.2016.06.001. Epub 2016 Jun 10.
8
Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation.医学主题词表(MeSH)的两种相似性度量:助力生物医学文本挖掘与作者姓名消歧
J Biomed Discov Collab. 2016 Apr 6;7:e1. doi: 10.5210/disco.v7i0.6654.
9
Three journal similarity metrics and their application to biomedical journals.三种期刊相似性指标及其在生物医学期刊中的应用。
PLoS One. 2014 Dec 23;9(12):e115681. doi: 10.1371/journal.pone.0115681. eCollection 2014.
10
A literature-based assessment of concept pairs as a measure of semantic relatedness.基于文献的概念对评估作为语义相关性的一种度量。
AMIA Annu Symp Proc. 2013 Nov 16;2013:1512-21. eCollection 2013.

用于单词、短语和文本的无监督低维向量表示,具有透明性、可扩展性,并能产生与神经嵌入不冗余的相似性度量。

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

作者信息

Smalheiser Neil R, Cohen Aaron M, Bonifield Gary

机构信息

Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL 60612, USA.

Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA.

出版信息

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

DOI:10.1016/j.jbi.2019.103096
PMID:30654030
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6557457/
Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

摘要

神经嵌入是一组流行的方法,用于将单词、短语或文本表示为低维向量(通常为50 - 500维)。然而,难以以有意义的方式解释这些维度,并且创建神经嵌入需要对多个参数和超参数进行广泛的训练和调整。我们在此提出一种简单的无监督方法,用于将单词、短语或文本表示为低维向量,其中维度的含义和相对重要性通过检查是透明的。我们使用PubMed中的标题和摘要集作为语料库,创建了一个几乎全面的单词向量表示,并选择了双词、三词和缩写。这个向量用于创建几个新颖的隐式词 - 词和文本 - 文本相似性度量。隐式词 - 词相似性度量与人类对词对相似性和相关性的判断高度相关,并且在各种生物医学基准测试中优于或等同于所有其他报告的方法,包括在PubMed语料库上训练的神经嵌入的几种实现。我们的隐式词 - 词度量捕捉到的词 - 词相关性的不同方面与基于word2vec的度量不同,并且仅部分相关(根据任务和语料库,rho = 0.5 - 0.8)。单词、双词、三词、缩写以及PubMed标题 + 摘要的向量表示均可从http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html公开获取,根据CC - BY - NC许可发布。同一网站还提供了几个公共网络查询接口,包括一个允许用户指定一个给定单词并根据直接共现以及不同的隐式相似性度量查看其最相关术语的接口。