用于放射学自然语言处理的特定领域词嵌入

Domain specific word embeddings for natural language processing in radiology.

作者信息

Chen Timothy L, Emerling Max, Chaudhari Gunvant R, Chillakuru Yeshwant R, Seo Youngho, Vu Thienkhai H, Sohn Jae Ho

机构信息

University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; University of Illinois College of Medicine, 1853 W Polk St, Chicago, IL 60612, USA.

University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; University of California Berkeley, 2626 Hearst Ave, Berkeley, CA 94720, USA.

出版信息

J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.

DOI:10.1016/j.jbi.2020.103665

PMID:33333323

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7856086/

Abstract

BACKGROUND

There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus.

PURPOSE

We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text.

MATERIALS AND METHODS

Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar's test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance.

RESULTS

For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively).

CONCLUSION

We have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.

摘要

背景

放射学领域对基于机器学习的自然语言处理（NLP）方法的兴趣与日俱增；然而，由于缺乏放射学专用语料库，模型通常使用在通用网络语料库上训练的词嵌入。

目的

我们研究了Radiopaedia作为通用放射学语料库的潜力，以生成可用于提高放射学文本NLP任务性能的放射学特定词嵌入。

材料与方法

使用GloVe算法对从Radiopaedia收集的文章训练维度为50、100、200和300的嵌入，并在类比完成任务上进行评估。使用一个浅层神经网络，其输入为我们训练的嵌入或预训练的维基百科2014 + Gigaword 5（WG）嵌入，对Radiopaedia文章进行标注。基于完全匹配准确率和汉明损失评估标注性能。使用带连续性的McNemar检验以及Benjamini-Hochberg校正和5×2交叉验证配对双尾t检验来评估统计显著性。

结果

在类比任务的准确率方面，50维（50-D）的Radiopaedia嵌入在肿瘤起源类比（p < 0.05）和器官形容词类比（p < 0.01）上优于WG嵌入，而WG嵌入在炎症部位以及骨骼与肌肉类比上表现更优（p < 0.01）。两种嵌入在其他子类别上具有可比的性能。在标注任务中，基于Radiopaedia的模型在50、100、200和300维时，在完全匹配准确率（分别为p < 0.001、p < 0.001、p < 0.01和p < 0.05）和汉明损失（分别为p < 0.001、p < 0.001、p < 0.01和p < 0.05）方面均优于基于WG的模型。

结论

我们从Radiopaedia开发了一组词嵌入，并表明它们可以保留相关医学语义并提高放射学NLP任务的性能。我们的结果表明，培养放射学特定语料库未来可能会使放射学NLP模型受益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70cf/7856086/b294c01ba925/nihms-1657576-f0001.jpg

相似文献

Domain specific word embeddings for natural language processing in radiology.用于放射学自然语言处理的特定领域词嵌入

J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

Improved biomedical word embeddings in the transformer era.Transformer 时代改进的生物医学词向量。

J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.

The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.专业语料库对自然语言理解中词嵌入的影响。

Stud Health Technol Inform. 2020 Jun 16;270:432-436. doi: 10.3233/SHTI200197.

Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach.基于机器学习的自然语言处理方法对临床笔记进行医学子域分类。

BMC Med Inform Decis Mak. 2017 Dec 1;17(1):155. doi: 10.1186/s12911-017-0556-8.

Intrinsic Evaluation of Contextual and Non-contextual Word Embeddings using Radiology Reports.使用放射学报告对语境和非语境词嵌入进行内在评估。

AMIA Annu Symp Proc. 2022 Feb 21;2021:631-640. eCollection 2021.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.基于已发表病例报告训练的词嵌入模型轻巧、适用于临床任务且不包含受保护的健康信息。

J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.

Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.通用和特定词嵌入在研究转化阶段分类中的效用

AMIA Annu Symp Proc. 2018 Dec 5;2018:1405-1414. eCollection 2018.

引用本文的文献

Automated labelling of radiology reports using natural language processing: Comparison of traditional and newer methods.使用自然语言处理对放射学报告进行自动标注：传统方法与新方法的比较。

Health Care Sci. 2023 Apr 24;2(2):120-128. doi: 10.1002/hcs2.40. eCollection 2023 Apr.

Acute ischemic stroke prediction and predictive factors analysis using hematological indicators in elderly hypertensives post-transient ischemic attack.利用老年高血压患者短暂性脑缺血发作后血液学指标预测急性缺血性脑卒中及分析其预测因素。

Sci Rep. 2024 Jan 6;14(1):695. doi: 10.1038/s41598-024-51402-2.

ESR paper on structured reporting in radiology-update 2023.欧洲放射学会关于放射学结构化报告的论文——2023年更新版

Insights Imaging. 2023 Nov 23;14(1):199. doi: 10.1186/s13244-023-01560-0.

Using a classification model for determining the value of liver radiological reports of patients with colorectal cancer.使用分类模型确定结直肠癌患者肝脏放射学报告的价值。

Front Oncol. 2022 Nov 21;12:913806. doi: 10.3389/fonc.2022.913806. eCollection 2022.

Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT).使用基于转换器的双向编码器表示 (BERT) 和领域内预训练 (IDPT) 对耳鸣患者的可操作放射学报告进行自动文本分类。

BMC Med Inform Decis Mak. 2022 Jul 30;22(1):200. doi: 10.1186/s12911-022-01946-y.

本文引用的文献

Deep Learning for Natural Language Processing in Radiology-Fundamentals and a Systematic Review.放射学中自然语言处理的深度学习——基础与系统综述

J Am Coll Radiol. 2020 May;17(5):639-648. doi: 10.1016/j.jacr.2019.12.026. Epub 2020 Jan 28.

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.MIMIC-CXR，一个去标识化的、公开可用的、包含自由文本报告的胸部 X 光数据库。

Sci Data. 2019 Dec 12;6(1):317. doi: 10.1038/s41597-019-0322-0.

Classification of Pulmonary Nodular Findings based on Characterization of Change using Radiology Reports.基于放射学报告中变化特征的肺结节发现分类

AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:285-294. eCollection 2019.

Automatic Disease Annotation From Radiology Reports Using Artificial Intelligence Implemented by a Recurrent Neural Network.基于循环神经网络的人工智能自动从放射学报告中进行疾病标注。

AJR Am J Roentgenol. 2019 Apr;212(4):734-740. doi: 10.2214/AJR.18.19869. Epub 2019 Jan 30.

Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification.卷积神经网络 (CNN) 和循环神经网络 (RNN) 架构在放射学文本报告分类中的比较效果。

Artif Intell Med. 2019 Jun;97:79-88. doi: 10.1016/j.artmed.2018.11.004. Epub 2018 Nov 23.

Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches.使用深度学习方法从中文乳腺超声报告中提取 BI-RADS 结果。

Int J Med Inform. 2018 Nov;119:17-21. doi: 10.1016/j.ijmedinf.2018.08.009. Epub 2018 Aug 18.

Intelligent Word Embeddings of Free-Text Radiology Reports.自由文本放射学报告的智能词嵌入

AMIA Annu Symp Proc. 2018 Apr 16;2017:411-420. eCollection 2017.

Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort.基于智能词嵌入的放射学报告标注：应用于多机构胸部 CT 队列。

J Biomed Inform. 2018 Jan;77:11-20. doi: 10.1016/j.jbi.2017.11.012. Epub 2017 Nov 23.

Deep Learning to Classify Radiology Free-Text Reports.深度学习在放射科自由文本报告分类中的应用

Radiology. 2018 Mar;286(3):845-852. doi: 10.1148/radiol.2017171115. Epub 2017 Nov 13.

Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review.用于捕获和标准化非结构化临床信息的自然语言处理系统：一项系统综述。

J Biomed Inform. 2017 Sep;73:14-29. doi: 10.1016/j.jbi.2017.07.012. Epub 2017 Jul 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验