Niyonkuru Enock, Gomez Mauricio Soto, Casiraghi Elena, Antogiovanni Stephan, Blau Hannah, Reese Justin T, Valentini Giorgio, Robinson Peter N
The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
Trinity College, Hartford, CT, USA.
bioRxiv. 2024 Jul 4:2024.07.01.601556. doi: 10.1101/2024.07.01.601556.
Concept embeddings are low-dimensional vector representations of concepts such as MeSH:D009203 (Myocardial Infarction), whose similarity in the embedded vector space reflects their semantic similarity. Here, we test the hypothesis that non-biomedical concept synonym replacement can improve the quality of biomedical concepts embeddings.
We developed an approach that leverages WordNet to replace sets of synonyms with the most common representative of the synonym set.
We tested our approach on 1055 concept sets and found that, on average, the mean intra-cluster distance was reduced by 8% in the vector-space. Assuming that homophily of related concepts in the vector space is desirable, our approach tends to improve the quality of embeddings.
This pilot study shows that non-biomedical synonym replacement tends to improve the quality of embeddings of biomedical concepts using the Word2Vec algorithm. We have implemented our approach in a freely available Python package available at https://github.com/TheJacksonLaboratory/wn2vec.
概念嵌入是诸如医学主题词表:D009203(心肌梗死)等概念的低维向量表示,其在嵌入向量空间中的相似性反映了它们的语义相似性。在此,我们检验非生物医学概念同义词替换可提高生物医学概念嵌入质量这一假设。
我们开发了一种利用WordNet用同义词集最常见的代表来替换同义词集的方法。
我们在1055个概念集上测试了我们的方法,发现在向量空间中,平均而言,簇内平均距离降低了8%。假设向量空间中相关概念的同质性是可取的,我们的方法倾向于提高嵌入质量。
这项初步研究表明,使用Word2Vec算法,非生物医学同义词替换倾向于提高生物医学概念嵌入的质量。我们已将我们的方法实现为一个可在https://github.com/TheJacksonLaboratory/wn2vec获取的免费Python包。