特定医院领域适应对基于BERT模型的神经放射学报告分类的影响。

Impact of hospital-specific domain adaptation on BERT-based models to classify neuroradiology reports.

作者信息

Agarwal Siddharth, Wood David, Murray Benjamin A K, Wei Yiran, Busaidi Ayisha Al, Kafiabadi Sina, Guilhem Emily, Lynch Jeremy, Townend Matthew, Mazumder Asif, Barker Gareth J, Cole James H, Sasieni Peter, Ourselin Sebastien, Modat Marc, Booth Thomas C

机构信息

School of Biomedical Engineering & Imaging Sciences, King's College London, Becket House, London, UK.

Department of Neuroradiology, Ruskin Wing, King's College Hospital NHS Foundation Trust, London, UK.

出版信息

Eur Radiol. 2025 Mar 17. doi: 10.1007/s00330-025-11500-9.

DOI:10.1007/s00330-025-11500-9

PMID:40097844

Abstract

OBJECTIVES

To determine the effectiveness of hospital-specific domain adaptation through masked language modelling (MLM) on BERT-based models' performance in classifying neuroradiology reports, and to compare these models with open-source large language models (LLMs).

MATERIALS AND METHODS

This retrospective study (2008-2019) utilised 126,556 and 86,032 MRI brain reports from two tertiary hospitals-King's College Hospital (KCH) and Guys and St Thomas' Trust (GSTT). Various BERT-based models, including RoBERTa, BioBERT and RadBERT, underwent MLM on unlabelled reports from these centres. The downstream tasks were binary abnormality classification and multi-label classification. Performances of models with and without hospital-specific domain adaptation were compared against each other and LLMs on internal (KCH) and external (GSTT) hold-out test sets. Model performances for binary classification were compared using 2-way and 1-way ANOVA.

RESULTS

All models that underwent hospital-specific domain adaptation performed better than their baseline counterparts (all p-values < 0.001). For binary classification, MLM on all available unlabelled reports (194,467 reports) yielded the highest balanced accuracies (KCH: mean 97.0 ± 0.4% (standard deviation), GSTT: 95.5 ± 1.0%), after which no differences between BERT-based models remained (1-way ANOVA, p-values > 0.05). There was a log-linear relationship between the number of reports and performance. LLama-3.0 70B was the best-performing LLM (KCH: 97.1%, GSTT: 94.0%). Multi-label classification demonstrated consistent performance improvements from MLM for all abnormality categories.

CONCLUSION

Hospital-specific domain adaptation should be considered best practice when deploying BERT-based models in new clinical settings. When labelled data is scarce or unavailable, LLMs can serve as a viable alternative, assuming adequate computational power is accessible.

KEY POINTS

Question BERT-based models can classify radiology reports, but it is unclear if there is any incremental benefit from additional hospital-specific domain adaptation. Findings Hospital-specific domain adaptation resulted in the highest BERT-based model accuracies and performance scaled log-linearly with the number of reports. Clinical relevance BERT-based models after hospital-specific domain adaptation achieve the best classification results provided sufficient high-quality training labels. When labelled data is scarce, LLMs such as Llama-3.0 70B are a viable alternative provided there are sufficient computational resources.

摘要

目的

通过掩码语言建模（MLM）来确定特定医院领域适应对基于BERT的模型在神经放射学报告分类中的性能的有效性，并将这些模型与开源大语言模型（LLM）进行比较。

材料与方法

这项回顾性研究（2008 - 2019年）使用了来自两家三级医院——国王学院医院（KCH）和盖伊及圣托马斯信托医院（GSTT）的126,556份和86,032份脑部MRI报告。各种基于BERT的模型，包括RoBERTa、BioBERT和RadBERT，在来自这些中心的未标记报告上进行了MLM。下游任务是二元异常分类和多标签分类。将有无特定医院领域适应的模型性能在内部（KCH）和外部（GSTT）保留测试集上相互比较，并与LLM进行比较。使用双向和单向方差分析比较二元分类的模型性能。

结果

所有经过特定医院领域适应的模型表现均优于其基线对应模型（所有p值 < 0.001）。对于二元分类，对所有可用的未标记报告（194,467份报告）进行MLM产生了最高的平衡准确率（KCH：平均97.0 ± 0.4%（标准差）；GSTT：95.5 ± 1.0%），在此之后基于BERT的模型之间没有差异（单向方差分析，p值 > 0.05）。报告数量与性能之间存在对数线性关系。Llama - 3.0 70B是表现最佳的LLM（KCH：97.1%，GSTT：94.0%）。多标签分类表明，对于所有异常类别，MLM都带来了一致的性能提升。

结论

在新的临床环境中部署基于BERT的模型时，特定医院领域适应应被视为最佳实践。当标记数据稀缺或不可用时，假设具备足够的计算能力，LLM可以作为一种可行的替代方案。

关键点

问题基于BERT的模型可以对放射学报告进行分类，但尚不清楚额外的特定医院领域适应是否有任何增量益处。发现特定医院领域适应导致基于BERT的模型准确率最高，且性能与报告数量呈对数线性缩放。临床相关性经过特定医院领域适应的基于BERT的模型在提供足够高质量训练标签的情况下可实现最佳分类结果。当标记数据稀缺时，诸如Llama - 3.0 70B之类的LLM在有足够计算资源的情况下是一种可行替代方案。