Agarwal Siddharth, Wood David, Murray Benjamin A K, Wei Yiran, Busaidi Ayisha Al, Kafiabadi Sina, Guilhem Emily, Lynch Jeremy, Townend Matthew, Mazumder Asif, Barker Gareth J, Cole James H, Sasieni Peter, Ourselin Sebastien, Modat Marc, Booth Thomas C
School of Biomedical Engineering & Imaging Sciences, King's College London, Becket House, London, UK.
Department of Neuroradiology, Ruskin Wing, King's College Hospital NHS Foundation Trust, London, UK.
Eur Radiol. 2025 Mar 17. doi: 10.1007/s00330-025-11500-9.
To determine the effectiveness of hospital-specific domain adaptation through masked language modelling (MLM) on BERT-based models' performance in classifying neuroradiology reports, and to compare these models with open-source large language models (LLMs).
This retrospective study (2008-2019) utilised 126,556 and 86,032 MRI brain reports from two tertiary hospitals-King's College Hospital (KCH) and Guys and St Thomas' Trust (GSTT). Various BERT-based models, including RoBERTa, BioBERT and RadBERT, underwent MLM on unlabelled reports from these centres. The downstream tasks were binary abnormality classification and multi-label classification. Performances of models with and without hospital-specific domain adaptation were compared against each other and LLMs on internal (KCH) and external (GSTT) hold-out test sets. Model performances for binary classification were compared using 2-way and 1-way ANOVA.
All models that underwent hospital-specific domain adaptation performed better than their baseline counterparts (all p-values < 0.001). For binary classification, MLM on all available unlabelled reports (194,467 reports) yielded the highest balanced accuracies (KCH: mean 97.0 ± 0.4% (standard deviation), GSTT: 95.5 ± 1.0%), after which no differences between BERT-based models remained (1-way ANOVA, p-values > 0.05). There was a log-linear relationship between the number of reports and performance. LLama-3.0 70B was the best-performing LLM (KCH: 97.1%, GSTT: 94.0%). Multi-label classification demonstrated consistent performance improvements from MLM for all abnormality categories.
Hospital-specific domain adaptation should be considered best practice when deploying BERT-based models in new clinical settings. When labelled data is scarce or unavailable, LLMs can serve as a viable alternative, assuming adequate computational power is accessible.
Question BERT-based models can classify radiology reports, but it is unclear if there is any incremental benefit from additional hospital-specific domain adaptation. Findings Hospital-specific domain adaptation resulted in the highest BERT-based model accuracies and performance scaled log-linearly with the number of reports. Clinical relevance BERT-based models after hospital-specific domain adaptation achieve the best classification results provided sufficient high-quality training labels. When labelled data is scarce, LLMs such as Llama-3.0 70B are a viable alternative provided there are sufficient computational resources.
通过掩码语言建模(MLM)来确定特定医院领域适应对基于BERT的模型在神经放射学报告分类中的性能的有效性,并将这些模型与开源大语言模型(LLM)进行比较。
这项回顾性研究(2008 - 2019年)使用了来自两家三级医院——国王学院医院(KCH)和盖伊及圣托马斯信托医院(GSTT)的126,556份和86,032份脑部MRI报告。各种基于BERT的模型,包括RoBERTa、BioBERT和RadBERT,在来自这些中心的未标记报告上进行了MLM。下游任务是二元异常分类和多标签分类。将有无特定医院领域适应的模型性能在内部(KCH)和外部(GSTT)保留测试集上相互比较,并与LLM进行比较。使用双向和单向方差分析比较二元分类的模型性能。
所有经过特定医院领域适应的模型表现均优于其基线对应模型(所有p值 < 0.001)。对于二元分类,对所有可用的未标记报告(194,467份报告)进行MLM产生了最高的平衡准确率(KCH:平均97.0 ± 0.4%(标准差);GSTT:95.5 ± 1.0%),在此之后基于BERT的模型之间没有差异(单向方差分析,p值 > 0.05)。报告数量与性能之间存在对数线性关系。Llama - 3.0 70B是表现最佳的LLM(KCH:97.1%,GSTT:94.0%)。多标签分类表明,对于所有异常类别,MLM都带来了一致的性能提升。
在新的临床环境中部署基于BERT的模型时,特定医院领域适应应被视为最佳实践。当标记数据稀缺或不可用时,假设具备足够的计算能力,LLM可以作为一种可行的替代方案。
问题基于BERT的模型可以对放射学报告进行分类,但尚不清楚额外的特定医院领域适应是否有任何增量益处。发现特定医院领域适应导致基于BERT的模型准确率最高,且性能与报告数量呈对数线性缩放。临床相关性经过特定医院领域适应的基于BERT的模型在提供足够高质量训练标签的情况下可实现最佳分类结果。当标记数据稀缺时,诸如Llama - 3.0 70B之类的LLM在有足够计算资源的情况下是一种可行替代方案。