优化低资源领域中训练词嵌入的语料库创建：以自闭症谱系障碍（ASD）为例

Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD).

作者信息

Gu Yang, Leroy Gondy, Pettygrove Sydney, Galindo Maureen Kelly, Kurzius-Spencer Margaret

机构信息

University of Arizona, Tucson, Arizona.

出版信息

AMIA Annu Symp Proc. 2018 Dec 5;2018:508-517. eCollection 2018.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6371367/

Abstract

Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.

摘要

在电子健康记录（EHR）中自动提取表明自闭症谱系障碍（ASD）的行为标准，对于监测该病症的工作可做出重大贡献。诸如Word2Vec之类的词嵌入算法能够将单词的语义编码到向量中，并有助于从电子健康记录中自动发现词汇。然而，与通常使用的数十亿个词元相比，可用于训练ASD词嵌入的文本数量极少。我们评估了语料库特异性与规模的重要性，并假设对于特定领域，小型语料库可以生成出色的词嵌入。我们使用ASD电子健康记录以及来自PubMed（N = 39K）和PsychInfo（N = 69K）的摘要，定制构建了6个以ASD为主题的语料库（N = 4482）并对其进行了评估。基于少量的ASD电子健康记录数据，我们能够生成最有用的200维嵌入。由于其词汇的多样性，基于摘要的嵌入生成的相关术语较少，并且当语料库规模增加时，改进极小。

相似文献

1

Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD).优化低资源领域中训练词嵌入的语料库创建：以自闭症谱系障碍（ASD）为例

AMIA Annu Symp Proc. 2018 Dec 5;2018:508-517. eCollection 2018.

2

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

3

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study.用于对ICD-10-CM编码进行分类的混合采样训练投影词嵌入模型：纵向观察研究

JMIR Med Inform. 2019 Jul 23;7(3):e14499. doi: 10.2196/14499.

4

Automated Extraction of Diagnostic Criteria From Electronic Health Records for Autism Spectrum Disorders: Development, Evaluation, and Application.从电子健康记录中自动提取自闭症谱系障碍的诊断标准：开发、评估与应用

J Med Internet Res. 2018 Nov 7;20(11):e10497. doi: 10.2196/10497.

5

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

6

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.优化小数据集的词向量：以乳腺癌患者的患者门户消息为例的研究。

Sci Rep. 2024 Jul 12;14(1):16117. doi: 10.1038/s41598-024-66319-z.

7

Detection of Suicidality in Adolescents with Autism Spectrum Disorders: Developing a Natural Language Processing Approach for Use in Electronic Health Records.自闭症谱系障碍青少年自杀倾向的检测：开发一种用于电子健康记录的自然语言处理方法。

AMIA Annu Symp Proc. 2018 Apr 16;2017:641-649. eCollection 2017.

8

Disease Concept-Embedding Based on the Self-Supervised Method for Medical Information Extraction from Electronic Health Records and Disease Retrieval: Algorithm Development and Validation Study.基于自监督方法的疾病概念嵌入在电子健康记录中的医学信息提取和疾病检索：算法开发和验证研究。

J Med Internet Res. 2021 Jan 27;23(1):e25113. doi: 10.2196/25113.

9

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

10

Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network.基于深度神经网络的中文临床文本命名实体识别

Stud Health Technol Inform. 2015;216:624-8.

引用本文的文献

1

Transparent deep learning to identify autism spectrum disorders (ASD) in EHR using clinical notes.利用电子健康记录中的临床记录进行透明的深度学习以识别自闭症谱系障碍（ASD）。

J Am Med Inform Assoc. 2024 May 20;31(6):1313-1321. doi: 10.1093/jamia/ocae080.

2

Development and evaluation of novel ophthalmology domain-specific neural word embeddings to predict visual prognosis.开发和评估新型眼科领域特定的神经词汇向量以预测视觉预后。

Int J Med Inform. 2021 Jun;150:104464. doi: 10.1016/j.ijmedinf.2021.104464. Epub 2021 Apr 16.

本文引用的文献

1

Bidirectional RNN for Medical Event Detection in Electronic Health Records.用于电子健康记录中医疗事件检测的双向循环神经网络

Proc Conf. 2016 Jun;2016:473-482. doi: 10.18653/v1/n16-1056.

2

Text Classification towards Detecting Misdiagnosis of an Epilepsy Syndrome in a Pediatric Population.针对检测儿科人群中癫痫综合征误诊的文本分类

AMIA Annu Symp Proc. 2014 Nov 14;2014:1082-7. eCollection 2014.

3

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.社交媒体中的药物警戒：使用带有词嵌入聚类特征的序列标注挖掘药物不良反应提及信息。

J Am Med Inform Assoc. 2015 May;22(3):671-81. doi: 10.1093/jamia/ocu041. Epub 2015 Mar 9.

4

Evaluating the state of the art in disorder recognition and normalization of the clinical narrative.评估临床病历中疾病识别和规范化的当前技术水平。

J Am Med Inform Assoc. 2015 Jan;22(1):143-54. doi: 10.1136/amiajnl-2013-002544. Epub 2014 Aug 21.

5

Prevalence of autism spectrum disorder among children aged 8 years - autism and developmental disabilities monitoring network, 11 sites, United States, 2010.8 岁儿童自闭症谱系障碍患病率 - 自闭症和发育障碍监测网络，11 个地点，美国，2010 年。

MMWR Surveill Summ. 2014 Mar 28;63(2):1-21.

6

Methods for identifying suicide or suicidal ideation in EHRs.电子健康记录中识别自杀或自杀意念的方法。

AMIA Annu Symp Proc. 2012;2012:1244-53. Epub 2012 Nov 3.

7

EpiDEA: extracting structured epilepsy and seizure information from patient discharge summaries for cohort identification.EpiDEA：从患者出院小结中提取结构化癫痫和发作信息以进行队列识别。

AMIA Annu Symp Proc. 2012;2012:1191-200. Epub 2012 Nov 3.

8

A machine learning approach for identifying anatomical locations of actionable findings in radiology reports.一种用于识别放射学报告中可采取行动的发现的解剖位置的机器学习方法。

AMIA Annu Symp Proc. 2012;2012:779-88. Epub 2012 Nov 3.

9

Measures of semantic similarity and relatedness in the biomedical domain.生物医学领域中语义相似性和相关性的度量。

J Biomed Inform. 2007 Jun;40(3):288-99. doi: 10.1016/j.jbi.2006.06.004. Epub 2006 Jun 10.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验