Suppr超能文献

用于放射学自然语言处理的特定领域词嵌入

Domain specific word embeddings for natural language processing in radiology.

作者信息

Chen Timothy L, Emerling Max, Chaudhari Gunvant R, Chillakuru Yeshwant R, Seo Youngho, Vu Thienkhai H, Sohn Jae Ho

机构信息

University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; University of Illinois College of Medicine, 1853 W Polk St, Chicago, IL 60612, USA.

University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; University of California Berkeley, 2626 Hearst Ave, Berkeley, CA 94720, USA.

出版信息

J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.

Abstract

BACKGROUND

There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus.

PURPOSE

We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text.

MATERIALS AND METHODS

Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar's test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance.

RESULTS

For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively).

CONCLUSION

We have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.

摘要

背景

放射学领域对基于机器学习的自然语言处理(NLP)方法的兴趣与日俱增;然而,由于缺乏放射学专用语料库,模型通常使用在通用网络语料库上训练的词嵌入。

目的

我们研究了Radiopaedia作为通用放射学语料库的潜力,以生成可用于提高放射学文本NLP任务性能的放射学特定词嵌入。

材料与方法

使用GloVe算法对从Radiopaedia收集的文章训练维度为50、100、200和300的嵌入,并在类比完成任务上进行评估。使用一个浅层神经网络,其输入为我们训练的嵌入或预训练的维基百科2014 + Gigaword 5(WG)嵌入,对Radiopaedia文章进行标注。基于完全匹配准确率和汉明损失评估标注性能。使用带连续性的McNemar检验以及Benjamini-Hochberg校正和5×2交叉验证配对双尾t检验来评估统计显著性。

结果

在类比任务的准确率方面,50维(50-D)的Radiopaedia嵌入在肿瘤起源类比(p < 0.05)和器官形容词类比(p < 0.01)上优于WG嵌入,而WG嵌入在炎症部位以及骨骼与肌肉类比上表现更优(p < 0.01)。两种嵌入在其他子类别上具有可比的性能。在标注任务中,基于Radiopaedia的模型在50、100、200和300维时,在完全匹配准确率(分别为p < 0.001、p < 0.001、p < 0.01和p < 0.05)和汉明损失(分别为p < 0.001、p < 0.001、p < 0.01和p < 0.05)方面均优于基于WG的模型。

结论

我们从Radiopaedia开发了一组词嵌入,并表明它们可以保留相关医学语义并提高放射学NLP任务的性能。我们的结果表明,培养放射学特定语料库未来可能会使放射学NLP模型受益。

相似文献

4
Improved biomedical word embeddings in the transformer era.Transformer 时代改进的生物医学词向量。
J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.

引用本文的文献

本文引用的文献

9
Deep Learning to Classify Radiology Free-Text Reports.深度学习在放射科自由文本报告分类中的应用
Radiology. 2018 Mar;286(3):845-852. doi: 10.1148/radiol.2017171115. Epub 2017 Nov 13.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验