Suppr超能文献

优化低资源领域中训练词嵌入的语料库创建:以自闭症谱系障碍(ASD)为例

Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD).

作者信息

Gu Yang, Leroy Gondy, Pettygrove Sydney, Galindo Maureen Kelly, Kurzius-Spencer Margaret

机构信息

University of Arizona, Tucson, Arizona.

出版信息

AMIA Annu Symp Proc. 2018 Dec 5;2018:508-517. eCollection 2018.

Abstract

Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.

摘要

在电子健康记录(EHR)中自动提取表明自闭症谱系障碍(ASD)的行为标准,对于监测该病症的工作可做出重大贡献。诸如Word2Vec之类的词嵌入算法能够将单词的语义编码到向量中,并有助于从电子健康记录中自动发现词汇。然而,与通常使用的数十亿个词元相比,可用于训练ASD词嵌入的文本数量极少。我们评估了语料库特异性与规模的重要性,并假设对于特定领域,小型语料库可以生成出色的词嵌入。我们使用ASD电子健康记录以及来自PubMed(N = 39K)和PsychInfo(N = 69K)的摘要,定制构建了6个以ASD为主题的语料库(N = 4482)并对其进行了评估。基于少量的ASD电子健康记录数据,我们能够生成最有用的200维嵌入。由于其词汇的多样性,基于摘要的嵌入生成的相关术语较少,并且当语料库规模增加时,改进极小。

相似文献

本文引用的文献

9
Measures of semantic similarity and relatedness in the biomedical domain.生物医学领域中语义相似性和相关性的度量。
J Biomed Inform. 2007 Jun;40(3):288-99. doi: 10.1016/j.jbi.2006.06.004. Epub 2006 Jun 10.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验