语料库领域对医学术语分布语义建模的影响。

Corpus domain effects on distributional semantic modeling of medical terms.

作者信息

Pakhomov Serguei V S, Finley Greg, McEwan Reed, Wang Yan, Melton Genevieve B

机构信息

College of Pharmacy, University of Minnesota, Minneapolis, MN 55455, USA.

Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA.

出版信息

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

DOI:10.1093/bioinformatics/btw529

PMID:27531100

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5181540/

Abstract

MOTIVATION

Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated.

RESULTS

We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho = 0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho = 0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications.

AVAILABILITY AND IMPLEMENTATION

The software and reference standards used in this study to evaluate semantic similarity and relatedness measures are publicly available as detailed in the article.

CONTACT

pakh0002@umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.

摘要

动机

自动量化临床术语之间的语义相似性和相关性是从电子健康记录中进行文本挖掘的一个重要方面，电子健康记录越来越被视为临床基因组学和生物信息学研究中表型信息的宝贵来源。语义相关性度量发展的一个关键障碍是，除了主要医疗中心之外，研究人员和开发人员难以获得大量临床文本。普通英语和生物医学文献的文本是免费可得的；然而，它们作为临床领域的替代品来表示临床术语语义的有效性仍有待证明。

结果

我们构建了在一个公开可用的基准数据集中找到的临床术语的神经网络表示，该数据集已针对语义相似性和相关性进行了手动标注。以该基准为参考，比较了从三个领域（临床笔记、PubMed Central文章和维基百科）的文本语料库中计算出的相似性和相关性度量。我们发现，从PubMed Central存储库中的生物医学文章全文计算出的度量（相似性的rho值为0.62，相关性的rho值为0.58）与从临床报告中计算出的度量相当（相似性的rho值为0.60，相关性的rho值为0.57）。我们还评估了基于神经网络的相关性度量在临床文档检索任务和生物医学术语词义消歧任务中的查询扩展应用。我们发现，尽管存在一些局限性，但生物医学文章可用于替代临床报告来表示临床术语的语义，并且分布语义方法对临床和生物医学自然语言处理应用很有用。

可用性和实现方式

本研究中用于评估语义相似性和相关性度量的软件和参考标准如文章中所述可公开获取。

联系方式

pakh0002@umn.edu补充信息：补充数据可在《生物信息学》在线获取。

相似文献

Corpus domain effects on distributional semantic modeling of medical terms.

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text.

J Biomed Inform. 2013 Dec;46(6):1116-24. doi: 10.1016/j.jbi.2013.08.008. Epub 2013 Sep 4.

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

tESA: a distributional measure for calculating semantic relatedness.

J Biomed Semantics. 2016 Dec 28;7(1):67. doi: 10.1186/s13326-016-0109-6.

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

AMIA Annu Symp Proc. 2010 Nov 13;2010:572-6.

Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness.

Stud Health Technol Inform. 2017;245:657-661.

引用本文的文献

A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.

PLoS One. 2025 May 15;20(5):e0323535. doi: 10.1371/journal.pone.0323535. eCollection 2025.

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.

J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

Drug Saf. 2025 Apr;48(4):401-413. doi: 10.1007/s40264-024-01509-2. Epub 2025 Jan 20.

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.

Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.

Extracting Complementary and Integrative Health Approaches in Electronic Health Records.

J Healthc Inform Res. 2023 Aug 17;7(3):277-290. doi: 10.1007/s41666-023-00137-2. eCollection 2023 Sep.

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data.

JAMIA Open. 2023 May 31;6(2):ooad033. doi: 10.1093/jamiaopen/ooad033. eCollection 2023 Jul.

Validating the representation of distance between infarct diseases using word embedding.

BMC Med Inform Decis Mak. 2022 Dec 7;22(1):322. doi: 10.1186/s12911-022-02061-8.

Improving medical term embeddings using UMLS Metathesaurus.

BMC Med Inform Decis Mak. 2022 Apr 29;22(1):114. doi: 10.1186/s12911-022-01850-5.

Intrinsic Evaluation of Contextual and Non-contextual Word Embeddings using Radiology Reports.

AMIA Annu Symp Proc. 2022 Feb 21;2021:631-640. eCollection 2021.

Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large-Scale Text Corpora.

Cogn Sci. 2022 Feb;46(2):e13085. doi: 10.1111/cogs.13085.

本文引用的文献

NLP-PIER: A Scalable Natural Language Processing, Indexing, and Searching Architecture for Clinical Notes.

AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:150-9. eCollection 2016.

A-DaGO-Fun: an adaptable Gene Ontology semantic similarity-based functional analysis tool.

Bioinformatics. 2016 Feb 1;32(3):477-9. doi: 10.1093/bioinformatics/btv590. Epub 2015 Oct 17.

Validity of heart failure diagnoses in administrative databases: a systematic review and meta-analysis.

PLoS One. 2014 Aug 15;9(8):e104519. doi: 10.1371/journal.pone.0104519. eCollection 2014.

Billing code algorithms to identify cases of peripheral artery disease from administrative data.

J Am Med Inform Assoc. 2013 Dec;20(e2):e349-54. doi: 10.1136/amiajnl-2013-001827. Epub 2013 Oct 28.

Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text.

J Biomed Inform. 2013 Dec;46(6):1116-24. doi: 10.1016/j.jbi.2013.08.008. Epub 2013 Sep 4.

Exploiting disjointness axioms to improve semantic similarity measures.

Bioinformatics. 2013 Nov 1;29(21):2781-7. doi: 10.1093/bioinformatics/btt491. Epub 2013 Sep 3.

Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification.

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):882-6. doi: 10.1136/amiajnl-2012-001350. Epub 2012 Oct 16.

BMC Bioinformatics. 2012 Oct 10;13:261. doi: 10.1186/1471-2105-13-261.

Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty.

Bioinformatics. 2012 May 15;28(10):1383-9. doi: 10.1093/bioinformatics/bts129. Epub 2012 Apr 19.

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation.

BMC Bioinformatics. 2011 Jun 2;12:223. doi: 10.1186/1471-2105-12-223.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

语料库领域对医学术语分布语义建模的影响。

Corpus domain effects on distributional semantic modeling of medical terms.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

动机

结果

可用性和实现方式

联系方式

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献