Gu Yang, Leroy Gondy, Pettygrove Sydney, Galindo Maureen Kelly, Kurzius-Spencer Margaret
University of Arizona, Tucson, Arizona.
AMIA Annu Symp Proc. 2018 Dec 5;2018:508-517. eCollection 2018.
Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.
在电子健康记录(EHR)中自动提取表明自闭症谱系障碍(ASD)的行为标准,对于监测该病症的工作可做出重大贡献。诸如Word2Vec之类的词嵌入算法能够将单词的语义编码到向量中,并有助于从电子健康记录中自动发现词汇。然而,与通常使用的数十亿个词元相比,可用于训练ASD词嵌入的文本数量极少。我们评估了语料库特异性与规模的重要性,并假设对于特定领域,小型语料库可以生成出色的词嵌入。我们使用ASD电子健康记录以及来自PubMed(N = 39K)和PsychInfo(N = 69K)的摘要,定制构建了6个以ASD为主题的语料库(N = 4482)并对其进行了评估。基于少量的ASD电子健康记录数据,我们能够生成最有用的200维嵌入。由于其词汇的多样性,基于摘要的嵌入生成的相关术语较少,并且当语料库规模增加时,改进极小。