基于词嵌入挖掘健康论坛文本开发消费者健康词汇表：半自动方法

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach.

作者信息

Gu Gen, Zhang Xingting, Zhu Xingeng, Jian Zhe, Chen Ken, Wen Dong, Gao Li, Zhang Shaodian, Wang Fei, Ma Handong, Lei Jianbo

机构信息

Synyi Research, Shanghai, China.

Center for Medical Informatics, Peking University, Beijing, China.

出版信息

JMIR Med Inform. 2019 May 23;7(2):e12704. doi: 10.2196/12704.

DOI:10.2196/12704

PMID:31124461

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6552449/

Abstract

BACKGROUND

The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers' language.

OBJECTIVE

Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use.

METHODS

In this paper, we propose a consumer health term-finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR).

RESULTS

Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers.

CONCLUSIONS

By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development.

摘要

背景

医学领域消费者与专业人员之间的词汇差距阻碍了信息查询与交流。已开发出消费者健康词汇表以辅助此类信息学应用。如果词汇表能随着消费者语言的发展而演变，就能最好地实现这一目的。

目的

我们的目标是开发一种方法，用于识别并向消费者健康词汇表中添加新术语，使其能够跟上不断发展的医学知识和语言使用。

方法

在本文中，我们提出了一种基于分布式词向量空间模型的消费者健康术语发现框架。我们首先从大规模文本语料库中学习词向量，然后采用一种基于现有消费者健康词汇表的监督方法来学习词的向量表示，这种方法可以在无监督词嵌入学习后提供额外的监督微调。利用经过微调的词向量空间，我们通过向量空间中的语义距离来识别专业术语及其消费者变体对。随后对提取并标注的实体对进行人工审核，以验证所提方法生成的结果。使用平均倒数排名（MRR）对结果进行评估。

结果

人工评估表明，在未微调的词向量空间中，以专业概念或消费者概念作为查询来识别替代医学概念是可行的，但在最终微调后的词向量空间中结果更有前景。MRR值表明，平均而言，在未微调的词向量空间中，一个专业概念或消费者概念与其对应概念的接近程度约为第14位，而最终微调后的词向量空间中的MRR为8。此外，结果表明我们的方法可以收集消费者经常使用的缩写和常见错别字。

结论

通过整合大量文本信息和现有的消费者健康词汇表，我们的方法优于几种基线排序方法，并且在消费者健康词汇表开发过程中生成供人工审核的候选术语列表方面是有效的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a2cd/6552449/d4fec73bece4/medinform_v7i2e12704_fig1.jpg

相似文献

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach.

JMIR Med Inform. 2019 May 23;7(2):e12704. doi: 10.2196/12704.

Enriching consumer health vocabulary through mining a social Q&A site: A similarity-based approach.

J Biomed Inform. 2017 May;69:75-85. doi: 10.1016/j.jbi.2017.03.016. Epub 2017 Mar 27.

An automated method to enrich consumer health vocabularies using GloVe word embeddings and an auxiliary lexical resource.

PeerJ Comput Sci. 2021 Aug 9;7:e668. doi: 10.7717/peerj-cs.668. eCollection 2021.

Computer-assisted update of a consumer health vocabulary through mining of social network data.

J Med Internet Res. 2011 May 17;13(2):e37. doi: 10.2196/jmir.1636.

Consumers' Use of UMLS Concepts on Social Media: Diabetes-Related Textual Data Analysis in Blog and Social Q&A Sites.

JMIR Med Inform. 2016 Nov 24;4(4):e41. doi: 10.2196/medinform.5748.

Exploring and developing consumer health vocabularies.

J Am Med Inform Assoc. 2006 Jan-Feb;13(1):24-9. doi: 10.1197/jamia.M1761. Epub 2005 Oct 12.

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.

Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.

Mining consumer health vocabulary from community-generated text.

AMIA Annu Symp Proc. 2014 Nov 14;2014:1150-9. eCollection 2014.

Consumer health concepts that do not map to the UMLS: where do they fit?

J Am Med Inform Assoc. 2008 Jul-Aug;15(4):496-505. doi: 10.1197/jamia.M2599. Epub 2008 Apr 24.

Exploring medical expressions used by consumers and the media: an emerging view of consumer health vocabularies.

AMIA Annu Symp Proc. 2003;2003:674-8.

引用本文的文献

Evaluating Expert-Layperson Agreement in Identifying Jargon Terms in Electronic Health Record Notes: Observational Study.

J Med Internet Res. 2024 Oct 15;26:e49704. doi: 10.2196/49704.

Search Term Identification Methods for Computational Health Communication: Word Embedding and Network Approach for Health Content on YouTube.

JMIR Med Inform. 2022 Aug 30;10(8):e37862. doi: 10.2196/37862.

Translation and Expansion: Enabling Laypeople Access to the COVID-19 Academic Collection.

Data Inf Manag. 2020 Sep 1;4(3):177-190. doi: 10.2478/dim-2020-0011. Epub 2022 Mar 31.

Evolutionary Overview of Consumer Health Informatics: Bibliometric Study on the Web of Science from 1999 to 2019.

J Med Internet Res. 2021 Sep 9;23(9):e21974. doi: 10.2196/21974.

An automated method to enrich consumer health vocabularies using GloVe word embeddings and an auxiliary lexical resource.

PeerJ Comput Sci. 2021 Aug 9;7:e668. doi: 10.7717/peerj-cs.668. eCollection 2021.

Clinician Perspectives and Design Implications in Using Patient-Generated Health Data to Improve Mental Health Practices: Mixed Methods Study.

JMIR Form Res. 2020 Aug 7;4(8):e18123. doi: 10.2196/18123.

本文引用的文献

The Drainage of Interstitial Fluid in the Deep Brain is Controlled by the Integrity of Myelination.

Aging Dis. 2019 Oct 1;10(5):937-948. doi: 10.14336/AD.2018.1206. eCollection 2019 Oct.

Health Care Provider Perceptions of Consumer-Grade Devices and Apps for Tracking Health: A Pilot Study.

JMIR Mhealth Uhealth. 2019 Jan 22;7(1):e9929. doi: 10.2196/mhealth.9929.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Enriching consumer health vocabulary through mining a social Q&A site: A similarity-based approach.

J Biomed Inform. 2017 May;69:75-85. doi: 10.1016/j.jbi.2017.03.016. Epub 2017 Mar 27.

The brain interstitial system: Anatomy, modeling, in vivo measurement, and applications.

Prog Neurobiol. 2017 Oct;157:230-246. doi: 10.1016/j.pneurobio.2015.12.007. Epub 2016 Feb 1.

Mining consumer health vocabulary from community-generated text.

AMIA Annu Symp Proc. 2014 Nov 14;2014:1150-9. eCollection 2014.

Characterizing the sublanguage of online breast cancer forums for medications, symptoms, and emotions.

AMIA Annu Symp Proc. 2014 Nov 14;2014:516-25. eCollection 2014.

Identifying synonymy between SNOMED clinical terms of varying length using distributional analysis of electronic health records.

AMIA Annu Symp Proc. 2013 Nov 16;2013:600-9. eCollection 2013.

Synonym extraction and abbreviation expansion with ensembles of semantic spaces.

J Biomed Semantics. 2014 Feb 5;5(1):6. doi: 10.1186/2041-1480-5-6.

Computer-assisted update of a consumer health vocabulary through mining of social network data.

J Med Internet Res. 2011 May 17;13(2):e37. doi: 10.2196/jmir.1636.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于词嵌入挖掘健康论坛文本开发消费者健康词汇表：半自动方法

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献