Dynomant Emeric, Lelong Romain, Dahamna Badisse, Massonnaud Clément, Kerdelhué Gaétan, Grosjean Julien, Canu Stéphane, Darmoni Stefan J
OmicX, Le Petit Quevilly, France.
Rouen University Hospital, Department of Biomedical Informatics, D2IM, Rouen, France.
JMIR Med Inform. 2019 Jul 29;7(3):e12310. doi: 10.2196/12310.
Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset.
The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator.
Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization.
Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture.
Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.
词嵌入技术是自然语言处理(NLP)中的一组语言建模和特征学习技术,目前已广泛应用于各种领域。然而,对于当前最著名的三种无监督实现方式(Word2Vec、GloVe和FastText)在使用相同数据集进行训练时跟踪词间语义相似性的能力,尚未进行正式的评估和比较。
本研究旨在比较在专业背景下生成的法语健康相关文档语料库上训练的嵌入方法。最佳方法将有助于我们开发一种新的语义注释器。
在来自鲁昂大学医院的641,279份文档上训练无监督嵌入模型。这些数据是非结构化的,涵盖了临床环境中生成的各种文档(出院小结、手术报告和处方)。总共定义了4个评分评估任务(余弦相似度、异常项、基于类比的操作和人工形式评估)并应用于每个模型,同时进行嵌入可视化。
Word2Vec在4个评分任务中的3个(基于类比的操作、异常项相似度和人工验证)得分最高,特别是在跳字模型架构方面。
尽管此实现方式在保留语义属性方面具有最佳比率,但每个模型都有其自身的优点和缺点,例如训练时间,GloVe的训练时间非常短,或者FastText在保留形态相似性方面的表现。本研究产生的模型和测试集将率先通过图形界面公开提供,以帮助推动法语生物医学研究。