用于医疗保健领域法语自然语言的词嵌入：比较研究

Word Embedding for the French Natural Language in Health Care: Comparative Study.

作者信息

Dynomant Emeric, Lelong Romain, Dahamna Badisse, Massonnaud Clément, Kerdelhué Gaétan, Grosjean Julien, Canu Stéphane, Darmoni Stefan J

机构信息

OmicX, Le Petit Quevilly, France.

Rouen University Hospital, Department of Biomedical Informatics, D2IM, Rouen, France.

出版信息

JMIR Med Inform. 2019 Jul 29;7(3):e12310. doi: 10.2196/12310.

DOI:10.2196/12310

PMID:31359873

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6690161/

Abstract

BACKGROUND

Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset.

OBJECTIVE

The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator.

METHODS

Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization.

RESULTS

Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture.

CONCLUSIONS

Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.

摘要

背景

词嵌入技术是自然语言处理（NLP）中的一组语言建模和特征学习技术，目前已广泛应用于各种领域。然而，对于当前最著名的三种无监督实现方式（Word2Vec、GloVe和FastText）在使用相同数据集进行训练时跟踪词间语义相似性的能力，尚未进行正式的评估和比较。

目的

本研究旨在比较在专业背景下生成的法语健康相关文档语料库上训练的嵌入方法。最佳方法将有助于我们开发一种新的语义注释器。

方法

在来自鲁昂大学医院的641,279份文档上训练无监督嵌入模型。这些数据是非结构化的，涵盖了临床环境中生成的各种文档（出院小结、手术报告和处方）。总共定义了4个评分评估任务（余弦相似度、异常项、基于类比的操作和人工形式评估）并应用于每个模型，同时进行嵌入可视化。

结果

Word2Vec在4个评分任务中的3个（基于类比的操作、异常项相似度和人工验证）得分最高，特别是在跳字模型架构方面。

结论

尽管此实现方式在保留语义属性方面具有最佳比率，但每个模型都有其自身的优点和缺点，例如训练时间，GloVe的训练时间非常短，或者FastText在保留形态相似性方面的表现。本研究产生的模型和测试集将率先通过图形界面公开提供，以帮助推动法语生物医学研究。

相似文献

Word Embedding for the French Natural Language in Health Care: Comparative Study.用于医疗保健领域法语自然语言的词嵌入：比较研究

JMIR Med Inform. 2019 Jul 29;7(3):e12310. doi: 10.2196/12310.

Word Embedding for French Natural Language in Healthcare: A Comparative Study.医疗保健领域法语自然语言的词嵌入：一项比较研究。

Stud Health Technol Inform. 2019 Aug 21;264:118-122. doi: 10.3233/SHTI190195.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.一个用于韩语医学词汇语义相似性和相关性的词对数据集：参考开发与验证

JMIR Med Inform. 2021 Jun 24;9(6):e29667. doi: 10.2196/29667.

Fast and scalable neural embedding models for biomedical sentence classification.用于生物医学句子分类的快速可扩展神经嵌入模型。

BMC Bioinformatics. 2018 Dec 22;19(1):541. doi: 10.1186/s12859-018-2496-4.

Neural sentence embedding models for semantic similarity estimation in the biomedical domain.生物医学领域中语义相似度估计的神经句子嵌入模型。

BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.

Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性：评估与表征分析

JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.优化小数据集的词向量：以乳腺癌患者的患者门户消息为例的研究。

Sci Rep. 2024 Jul 12;14(1):16117. doi: 10.1038/s41598-024-66319-z.

Semantic Deep Learning: Prior Knowledge and a Type of Four-Term Embedding Analogy to Acquire Treatments for Well-Known Diseases.语义深度学习：先验知识与一种用于获取知名疾病治疗方法的四项嵌入类比。

JMIR Med Inform. 2020 Aug 6;8(8):e16948. doi: 10.2196/16948.

A Comparison of Word Embeddings to Study Complications in Neurosurgery.神经外科并发症研究中的词向量比较。

Stud Health Technol Inform. 2022 Jan 14;289:5-8. doi: 10.3233/SHTI210845.

引用本文的文献

Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England.在英国基层医疗中，对患有多种疾病的一千万患者的疾病多分辨率聚类进行识别。

Commun Med (Lond). 2024 May 29;4(1):102. doi: 10.1038/s43856-024-00529-4.

Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review.将自然语言处理应用于临床数据仓库中的文本数据：系统评价。

JMIR Med Inform. 2023 Dec 15;11:e42477. doi: 10.2196/42477.

Detecting of a Patient's Condition From Clinical Narratives Using Natural Language Representation.使用自然语言表示从临床叙述中检测患者病情

IEEE Open J Eng Med Biol. 2022 Sep 26;3:142-149. doi: 10.1109/OJEMB.2022.3209900. eCollection 2022.

Validating the representation of distance between infarct diseases using word embedding.使用词嵌入验证梗死疾病之间距离的表示。

BMC Med Inform Decis Mak. 2022 Dec 7;22(1):322. doi: 10.1186/s12911-022-02061-8.

Visualization of medical concepts represented using word embeddings: a scoping review.基于词向量表示的医学概念可视化：范围综述。

BMC Med Inform Decis Mak. 2022 Mar 29;22(1):83. doi: 10.1186/s12911-022-01822-9.

JMIR Med Inform. 2021 Jun 24;9(6):e29667. doi: 10.2196/29667.

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit.探索残疾人的背景：语义分类测试以及来自Reddit的词嵌入的环境因素映射

JMIR Med Inform. 2020 Nov 20;8(11):e17903. doi: 10.2196/17903.

JMIR Med Inform. 2020 Aug 6;8(8):e16948. doi: 10.2196/16948.

本文引用的文献

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Accuracy of using natural language processing methods for identifying healthcare-associated infections.使用自然语言处理方法识别医疗保健相关感染的准确性。

Int J Med Inform. 2018 Sep;117:96-102. doi: 10.1016/j.ijmedinf.2018.06.002. Epub 2018 Jun 6.

Querying EHRs with a Semantic and Entity-Oriented Query Language.使用语义和面向实体的查询语言查询电子健康记录。

Stud Health Technol Inform. 2017;235:121-125.

[LiSSa: An alternative in French to browse health scientific literature ?].

Presse Med. 2016 Nov;45(11):955-956. doi: 10.1016/j.lpm.2016.11.001.

Interrater reliability: the kappa statistic.组内一致性：kappa 统计量。

Biochem Med (Zagreb). 2012;22(3):276-82.

Health multi-terminology portal: a semantic added-value for patient safety.健康多术语门户：对患者安全的语义附加值。

Stud Health Technol Inform. 2011;166:129-38.

AMIA Annu Symp Proc. 2010 Nov 13;2010:572-6.

Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.加权kappa系数：用于衡量名义尺度上的一致性，并考虑了尺度不一致或部分得分的情况。

Psychol Bull. 1968 Oct;70(4):213-20. doi: 10.1037/h0026256.

Taking on the curse of dimensionality in joint distributions using neural networks.

IEEE Trans Neural Netw. 2000;11(3):550-7. doi: 10.1109/72.846725.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于医疗保健领域法语自然语言的词嵌入：比较研究

Word Embedding for the French Natural Language in Health Care: Comparative Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献