生物医学术语的语义相关性和相似性：研究生物医学出版物的时效性、篇幅大小和章节对word2vec性能的影响。

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec.

作者信息

Zhu Yongjun, Yan Erjia, Wang Fei

机构信息

Healthcare Policy and Research, Weill Cornell Medicine, Cornell University, New York, NY, USA.

College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.

出版信息

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

DOI:10.1186/s12911-017-0498-1

PMID:28673289

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5496182/

Abstract

BACKGROUND

Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec's ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec.

METHODS

We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects.

RESULTS

Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task).

CONCLUSIONS

Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.

摘要

背景

理解生物医学术语之间的语义相关性和相似性对生物医学信息检索、信息提取和推荐系统等多种应用有很大影响。本研究的目的是检验word2vec从大量出版物数据中推导生物医学术语之间语义相关性和相似性的能力。具体而言，我们关注生物医学出版物数据的时效性、规模和部分对word2vec性能的影响。

方法

我们从PubMed下载了18,777,129篇文章的摘要，并从PubMed Central（PMC）下载了766,326篇全文文章。对数据集进行预处理，并按时效性、规模和部分进行分组。在这些子测试上训练word2vec模型。将从word2vec模型获得的生物医学术语之间的余弦相似度与参考标准进行比较。比较在不同子集上训练的模型的性能，以检验时效性、规模和部分的影响。

结果

在近期数据集上训练的模型并没有提高性能。在相关性任务（从10%水平的368对到100%水平的494对）和相似性任务（从10%水平的374对到100%水平的491对）中，在较大数据集上训练的模型比在较小数据集上训练的模型识别出更多的生物医学术语对。在摘要上训练的模型产生的结果与参考标准的相关性高于在文章主体上训练的模型（即相似性任务中为0.65对0.62，相关性任务中为0.66对0.59）。然而，后者识别出的生物医学术语对比前者多（即相似性任务中为344对498对，相关性任务中为339对503对）。

结论

增加数据集的规模并不总是能提高性能。增加数据集的规模可以导致识别出更多的生物医学术语关系，即使这并不能保证更高的精度。作为研究文章的摘要，与文章主体相比，摘要在准确性方面表现出色，但在可识别关系的覆盖范围方面有所欠缺。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a666/5496182/1116fe08d268/12911_2017_498_Fig1_HTML.jpg

相似文献

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

Neural sentence embedding models for semantic similarity estimation in the biomedical domain.生物医学领域中语义相似度估计的神经句子嵌入模型。

BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.在为美国国立医学图书馆医学主题词表（UMLS）注释的基于PubMed Central开放获取文章的语义相似性度量标准的研究中。

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

AMIA Annu Symp Proc. 2010 Nov 13;2010:572-6.

Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation.在PubMed查询中发现生物医学语义关系以进行信息检索和数据库管理。

Database (Oxford). 2016 Mar 25;2016. doi: 10.1093/database/baw025. Print 2016.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BIOASQ大规模生物医学语义索引与问答竞赛概述。

BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.

tESA: a distributional measure for calculating semantic relatedness.tESA：一种用于计算语义相关性的分布度量。

J Biomed Semantics. 2016 Dec 28;7(1):67. doi: 10.1186/s13326-016-0109-6.

Vector representations of multi-word terms for semantic relatedness.多词术语的语义关联的向量表示。

J Biomed Inform. 2018 Jan;77:111-119. doi: 10.1016/j.jbi.2017.12.006. Epub 2017 Dec 13.

引用本文的文献

Validating the representation of distance between infarct diseases using word embedding.使用词嵌入验证梗死疾病之间距离的表示。

BMC Med Inform Decis Mak. 2022 Dec 7;22(1):322. doi: 10.1186/s12911-022-02061-8.

The positive energy of netizens: development and application of fine-grained sentiment lexicon and emotional intensity model.网民正能量：细粒度情感词典与情感强度模型的发展与应用

Curr Psychol. 2022 Nov 3:1-18. doi: 10.1007/s12144-022-03876-4.

Machine-learning as a validated tool to characterize individual differences in free recall of naturalistic events.机器学习作为一种经过验证的工具，用于刻画自然事件自由回忆中的个体差异。

Psychon Bull Rev. 2023 Feb;30(1):308-316. doi: 10.3758/s13423-022-02171-4. Epub 2022 Sep 9.

An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents.基于多种证据的生物医学文档集成语义文本相似度度量。

Comput Math Methods Med. 2022 Aug 27;2022:8238432. doi: 10.1155/2022/8238432. eCollection 2022.

Model-Based Reasoning of Clinical Diagnosis in Integrative Medicine: Real-World Methodological Study of Electronic Medical Records and Natural Language Processing Methods.中西医结合临床诊断的基于模型的推理：电子病历与自然语言处理方法的真实世界方法学研究

JMIR Med Inform. 2020 Dec 21;8(12):e23082. doi: 10.2196/23082.

A Bayesian Failure Prediction Network Based on Text Sequence Mining and Clustering.基于文本序列挖掘与聚类的贝叶斯故障预测网络

Entropy (Basel). 2018 Dec 3;20(12):923. doi: 10.3390/e20120923.

Better synonyms for enriching biomedical search.更好的生物医学搜索丰富化的同义词。

J Am Med Inform Assoc. 2020 Dec 9;27(12):1894-1902. doi: 10.1093/jamia/ocaa151.

Feature-Based Learning in Drug Prescription System for Medical Clinics.医疗诊所药物处方系统中的基于特征的学习

Neural Process Lett. 2020;52(3):1703-1721. doi: 10.1007/s11063-020-10296-7. Epub 2020 Jul 2.

Characterization of near death experiences using text mining analyses: A preliminary study.使用文本挖掘分析对濒死体验进行特征描述：一项初步研究。

PLoS One. 2020 Jan 30;15(1):e0227402. doi: 10.1371/journal.pone.0227402. eCollection 2020.

Automated grouping of medical codes via multiview banded spectral clustering.通过多视图带谱聚类自动对医疗代码进行分组。

J Biomed Inform. 2019 Dec;100:103322. doi: 10.1016/j.jbi.2019.103322. Epub 2019 Oct 28.

本文引用的文献

Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

Identifying Liver Cancer and Its Relations with Diseases, Drugs, and Genes: A Literature-Based Approach.基于文献的方法识别肝癌及其与疾病、药物和基因的关系。

PLoS One. 2016 May 19;11(5):e0156091. doi: 10.1371/journal.pone.0156091. eCollection 2016.

Exploring the application of deep learning techniques on medical text corpora.探索深度学习技术在医学文本语料库上的应用。

Stud Health Technol Inform. 2014;205:584-8.

BMC Bioinformatics. 2012 Oct 10;13:261. doi: 10.1186/1471-2105-13-261.

AMIA Annu Symp Proc. 2010 Nov 13;2010:572-6.

Literature mining for the discovery of hidden connections between drugs, genes and diseases.文献挖掘发现药物、基因和疾病之间隐藏的关联。

PLoS Comput Biol. 2010 Sep 23;6(9):e1000943. doi: 10.1371/journal.pcbi.1000943.

PLoS Comput Biol. 2009 Jul;5(7):e1000443. doi: 10.1371/journal.pcbi.1000443. Epub 2009 Jul 31.

Measures of semantic similarity and relatedness in the biomedical domain.生物医学领域中语义相似性和相关性的度量。

J Biomed Inform. 2007 Jun;40(3):288-99. doi: 10.1016/j.jbi.2006.06.004. Epub 2006 Jun 10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生物医学术语的语义相关性和相似性：研究生物医学出版物的时效性、篇幅大小和章节对word2vec性能的影响。

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献