Suppr超能文献

基于词嵌入挖掘健康论坛文本开发消费者健康词汇表:半自动方法

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach.

作者信息

Gu Gen, Zhang Xingting, Zhu Xingeng, Jian Zhe, Chen Ken, Wen Dong, Gao Li, Zhang Shaodian, Wang Fei, Ma Handong, Lei Jianbo

机构信息

Synyi Research, Shanghai, China.

Center for Medical Informatics, Peking University, Beijing, China.

出版信息

JMIR Med Inform. 2019 May 23;7(2):e12704. doi: 10.2196/12704.

Abstract

BACKGROUND

The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers' language.

OBJECTIVE

Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use.

METHODS

In this paper, we propose a consumer health term-finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR).

RESULTS

Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers.

CONCLUSIONS

By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development.

摘要

背景

医学领域消费者与专业人员之间的词汇差距阻碍了信息查询与交流。已开发出消费者健康词汇表以辅助此类信息学应用。如果词汇表能随着消费者语言的发展而演变,就能最好地实现这一目的。

目的

我们的目标是开发一种方法,用于识别并向消费者健康词汇表中添加新术语,使其能够跟上不断发展的医学知识和语言使用。

方法

在本文中,我们提出了一种基于分布式词向量空间模型的消费者健康术语发现框架。我们首先从大规模文本语料库中学习词向量,然后采用一种基于现有消费者健康词汇表的监督方法来学习词的向量表示,这种方法可以在无监督词嵌入学习后提供额外的监督微调。利用经过微调的词向量空间,我们通过向量空间中的语义距离来识别专业术语及其消费者变体对。随后对提取并标注的实体对进行人工审核,以验证所提方法生成的结果。使用平均倒数排名(MRR)对结果进行评估。

结果

人工评估表明,在未微调的词向量空间中,以专业概念或消费者概念作为查询来识别替代医学概念是可行的,但在最终微调后的词向量空间中结果更有前景。MRR值表明,平均而言,在未微调的词向量空间中,一个专业概念或消费者概念与其对应概念的接近程度约为第14位,而最终微调后的词向量空间中的MRR为8。此外,结果表明我们的方法可以收集消费者经常使用的缩写和常见错别字。

结论

通过整合大量文本信息和现有的消费者健康词汇表,我们的方法优于几种基线排序方法,并且在消费者健康词汇表开发过程中生成供人工审核的候选术语列表方面是有效的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a2cd/6552449/d4fec73bece4/medinform_v7i2e12704_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验