医疗保健领域法语自然语言的词嵌入：一项比较研究。

Word Embedding for French Natural Language in Healthcare: A Comparative Study.

作者信息

Dynomant Emeric, Lelong Romain, Dahamna Badisse, Massonnaud Clément, Kerdelhué Gaëtan, Grosjean Julien, Canu Stéphane, Darmoni Stéfan

机构信息

OmicX, 72 Rue de la République, 76140, Le Petit Quevilly, Normandie, France.

Department of Biomedical Informatics, Cour Leschevin, CHU de Rouen, 1 Rue de Germont, 76031 Rouen, Normandie, France.

出版信息

Stud Health Technol Inform. 2019 Aug 21;264:118-122. doi: 10.3233/SHTI190195.

DOI:10.3233/SHTI190195

PMID:31437897

Abstract

Structuring raw medical documents with ontology mapping is now the next step for medical intelligence. Deep learning models take as input mathematically embedded information, such as encoded texts. To do so, word embedding methods can represent every word from a text as a fixed-length vector. A formal evaluation of three word embedding methods has been performed on raw medical documents. The data corresponds to more than 12M diverse documents produced in the Rouen hospital (drug prescriptions, discharge and surgery summaries, inter-services letters, etc.). Automatic and manual validation demonstrates that Word2Vec based on the skip-gram architecture had the best rate on three out of four accuracy tests. This model will now be used as the first layer of an AI-based semantic annotator.

摘要

通过本体映射来构建原始医学文档是医学智能的下一步。深度学习模型将数学嵌入信息作为输入，例如编码文本。为此，词嵌入方法可以将文本中的每个单词表示为固定长度的向量。已对原始医学文档进行了三种词嵌入方法的正式评估。数据对应于鲁昂医院生成的超过1200万份不同文档（药物处方、出院和手术总结、科室间信件等）。自动和手动验证表明，基于跳字架构的Word2Vec在四项准确性测试中的三项中具有最佳比率。该模型现在将用作基于人工智能的语义注释器的第一层。