Suppr超能文献

从文本到数据:用于从德语病理报告中提取癌症相关医学属性的开源大语言模型

From text to data: Open-source large language models in extracting cancer related medical attributes from German pathology reports.

作者信息

Bartels Stefan, Carus Jasmin

机构信息

University Medical Center Hamburg-Eppendorf / University Cancer Center, Martinistr. 52, Hamburg, 22767, Germany.

出版信息

Int J Med Inform. 2025 Nov;203:106022. doi: 10.1016/j.ijmedinf.2025.106022. Epub 2025 Jul 2.

Abstract

Structured oncological documentation is vital for data-driven cancer care, yet extracting clinical features from unstructured pathology reports remains challenging-especially in German healthcare, where strict data protection rules require local model deployment. This study evaluates open-source large language models (LLMs) for extracting oncological attributes from German pathology reports in a secure, on-premise setting. We created a gold-standard dataset of 522 annotated reports and developed a retrieval-augmented generation (RAG) pipeline using an additional 15,000 pathology reports. Five instruction-tuned LLMs (Llama 3.3 70B, Mistral Small 24B, and three SauerkrautLM variants) were evaluated using three prompting strategies: zero-shot, few-shot, and RAG-enhanced few-shot prompting. All models produced structured JSON outputs and were assessed using entity-level precision, recall, accuracy, and macro-averaged F1-score. Results show that Llama 3.3 70B achieved the highest overall performance (F1 > 0.90). However, when combined with the RAG pipeline, Mistral Small 24B achieved nearly equivalent performance, matching Llama 70B on most entity types while requiring significantly fewer computational resources. Prompting strategy significantly impacted performance: few-shot prompting improved baseline accuracy, and RAG further enhanced performance, particularly for models with fewer than 24B parameters. Challenges remained in extracting less frequent but clinically critical attributes like metastasis and staging, underscoring the importance of retrieval mechanisms and balanced training data. This study demonstrates that open-source LLMs, when paired with effective prompting and retrieval strategies, can enable high-quality, privacy-compliant extraction of oncological information from unstructured text. The finding that smaller models can match larger ones through retrieval augmentation highlights a path toward scalable, resource-efficient deployment in German clinical settings.

摘要

结构化肿瘤学文档对于数据驱动的癌症护理至关重要,但从非结构化病理报告中提取临床特征仍然具有挑战性,尤其是在德国医疗保健领域,严格的数据保护规则要求在本地部署模型。本研究评估了开源大语言模型(LLMs)在安全的本地环境中从德国病理报告中提取肿瘤学属性的能力。我们创建了一个包含522份注释报告的黄金标准数据集,并使用另外15000份病理报告开发了一个检索增强生成(RAG)管道。使用三种提示策略对五个经过指令微调的大语言模型(Llama 3.3 70B、Mistral Small 24B和三个SauerkrautLM变体)进行了评估:零样本、少样本和RAG增强少样本提示。所有模型都生成结构化的JSON输出,并使用实体级精度、召回率、准确率和宏平均F1分数进行评估。结果表明,Llama 3.3 70B总体性能最高(F1>0.90)。然而,当与RAG管道结合使用时,Mistral Small 24B实现了几乎相当的性能,在大多数实体类型上与Llama 70B匹配,同时所需的计算资源显著减少。提示策略对性能有显著影响:少样本提示提高了基线准确率,RAG进一步提高了性能,特别是对于参数少于24B的模型。在提取转移和分期等不太常见但临床关键的属性方面仍然存在挑战,这突出了检索机制和平衡训练数据的重要性。本研究表明,开源大语言模型与有效的提示和检索策略相结合,可以从非结构化文本中高质量、符合隐私要求地提取肿瘤学信息。较小的模型可以通过检索增强与较大的模型相匹配这一发现,为德国临床环境中可扩展、资源高效的部署指明了一条道路。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验