Suppr超能文献

替换非生物医学概念可改善生物医学概念的嵌入。

Replacing non-biomedical concepts improves embedding of biomedical concepts.

作者信息

Niyonkuru Enock, Gomez Mauricio Soto, Casarighi Elena, Antogiovanni Stephan, Blau Hannah, Reese Justin T, Valentini Giorgio, Robinson Peter N

机构信息

The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, United States of America.

Trinity College, Hartford, Connecticut, United States of America.

出版信息

PLoS One. 2025 May 5;20(5):e0322498. doi: 10.1371/journal.pone.0322498. eCollection 2025.

Abstract

Embeddings are semantically meaningful representations of words in a vector space, commonly used to enhance downstream machine learning applications. Traditional biomedical embedding techniques often replace all synonymous words representing biological or medical concepts with a unique token, ensuring consistent representation and improving embedding quality. However, the potential impact of replacing non-biomedical concept synonyms has received less attention. Embedding approaches often employ concept replacement to replace concepts that span multiple words, such as non-small-cell lung carcinoma, with a single concept identifier (e.g., D002289). Also, all synonyms of each concept are merged into the same identifier. Here, we additionally leveraged WordNet to identify and replace sets of non-biomedical synonyms with their most common representatives. This combined approach aimed to reduce embedding noise from non-biomedical terms while preserving the integrity of biomedical concept representations. We applied this method to 1,055 biomedical concept sets representing molecular signatures or medical categories and assessed the mean pairwise distance of embeddings with and without non-biomedical synonym replacement. A smaller mean pairwise distance was interpreted as greater intra-cluster coherence and higher embedding quality. Embeddings were generated using the Word2Vec algorithm applied to a corpus of 10 million PubMed abstracts. Our results demonstrate that the addition of non-biomedical synonym replacement reduced the mean intra-cluster distance by an average of 8%, suggesting that this complementary approach enhances embedding quality. Future work will assess its applicability to other embedding techniques and downstream tasks. Python code implementing this method is provided under an open-source license.

摘要

词嵌入是词在向量空间中的语义有意义的表示,常用于增强下游机器学习应用。传统的生物医学词嵌入技术通常用唯一的词元替换所有表示生物或医学概念的同义词,以确保一致的表示并提高词嵌入质量。然而,替换非生物医学概念同义词的潜在影响受到的关注较少。词嵌入方法通常采用概念替换,用单个概念标识符(例如,D002289)替换跨多个词的概念,如非小细胞肺癌。此外,每个概念的所有同义词都合并到同一个标识符中。在这里,我们还利用WordNet来识别非生物医学同义词集并用它们最常见的代表词进行替换。这种组合方法旨在减少非生物医学术语带来的词嵌入噪声,同时保持生物医学概念表示的完整性。我们将此方法应用于1055个表示分子特征或医学类别的生物医学概念集,并评估了有无非生物医学同义词替换时词嵌入的平均成对距离。较小的平均成对距离被解释为更高的簇内一致性和更高的词嵌入质量。使用应用于1000万篇PubMed摘要语料库的Word2Vec算法生成词嵌入。我们的结果表明,添加非生物医学同义词替换平均将簇内平均距离降低了8%,这表明这种补充方法提高了词嵌入质量。未来的工作将评估其对其他词嵌入技术和下游任务的适用性。实现此方法的Python代码在开源许可下提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/28d1/12052101/282714c0c428/pone.0322498.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验