Arzideh Kamyar, Schäfer Henning, Allende-Cid Héctor, Baldini Giulia, Hilser Thomas, Idrissi-Yaghir Ahmad, Laue Katharina, Chakraborty Nilesh, Doll Niclas, Antweiler Dario, Klug Katrin, Beck Niklas, Giesselbach Sven, Friedrich Christoph M, Nensa Felix, Schuler Martin, Hosch René
Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany; Central IT Department, Data Integration Center, University Hospital Essen, Essen, Germany.
Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany; Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany.
Comput Biol Med. 2025 Sep;195:110665. doi: 10.1016/j.compbiomed.2025.110665. Epub 2025 Jun 24.
Extracting clinical entities from unstructured medical documents is critical for improving clinical decision support and documentation workflows. This study examines the performance of various encoder and decoder models trained for Named Entity Recognition (NER) of clinical parameters in pathology and radiology reports, highlighting the applicability of Large Language Models (LLMs) for this task.
Three NER methods were evaluated: (1) flat NER using transformer-based models, (2) nested NER with a multi-task learning setup, and (3) instruction-based NER utilizing LLMs. A dataset of 2013 pathology reports and 413 radiology reports, annotated by medical students, was used for training and testing.
The performance of encoder-based NER models (flat and nested) was superior to that of LLM-based approaches. The best-performing flat NER models achieved F1-scores of 0.87-0.88 on pathology reports and up to 0.78 on radiology reports, while nested NER models performed slightly lower. In contrast, multiple LLMs, despite achieving high precision, yielded significantly lower F1-scores (ranging from 0.18 to 0.30) due to poor recall. A contributing factor appears to be that these LLMs produce fewer but more accurate entities, suggesting they become overly conservative when generating outputs.
LLMs in their current form are unsuitable for comprehensive entity extraction tasks in clinical domains, particularly when faced with a high number of entity types per document, though instructing them to return more entities in subsequent refinements may improve recall. Additionally, their computational overhead does not provide proportional performance gains. Encoder-based NER models, particularly those pre-trained on biomedical data, remain the preferred choice for extracting information from unstructured medical documents.
从非结构化医疗文档中提取临床实体对于改善临床决策支持和文档工作流程至关重要。本研究考察了针对病理和放射学报告中的临床参数进行命名实体识别(NER)训练的各种编码器和解码器模型的性能,突出了大语言模型(LLM)在此任务中的适用性。
评估了三种NER方法:(1)使用基于Transformer的模型的扁平NER,(2)具有多任务学习设置的嵌套NER,以及(3)利用LLM的基于指令的NER。使用由医学生注释的包含2013份病理报告和413份放射学报告的数据集进行训练和测试。
基于编码器的NER模型(扁平式和嵌套式)的性能优于基于LLM的方法。表现最佳的扁平NER模型在病理报告上的F1分数达到0.87 - 0.88,在放射学报告上高达0.78,而嵌套NER模型的表现略低。相比之下,多个LLM尽管精度较高,但由于召回率低,F1分数显著更低(范围从0.18到0.30)。一个促成因素似乎是这些LLM生成的实体数量较少但更准确,这表明它们在生成输出时变得过于保守。
当前形式的LLM不适用于临床领域的全面实体提取任务,特别是当每份文档面临大量实体类型时,尽管在后续优化中指示它们返回更多实体可能会提高召回率。此外,它们的计算开销并未带来成比例的性能提升。基于编码器的NER模型,特别是那些在生物医学数据上预训练的模型,仍然是从非结构化医疗文档中提取信息的首选。