在德国，开源大语言模型可用于肿瘤记录吗？——对泌尿科医生笔记的评估

Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.

作者信息

Lenz Stefan, Ustjanzew Arsenij, Jeray Marco, Ressing Meike, Panholzer Torsten

机构信息

Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg-University Mainz, Mainz, Germany.

Privacy, Compliance, and Risk Management Office, University Medical Centre of the Johannes Gutenberg-University Mainz, Mainz, Germany.

出版信息

BioData Min. 2025 Jul 24;18(1):48. doi: 10.1186/s13040-025-00463-8.

DOI:10.1186/s13040-025-00463-8

PMID:40707949

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12291363/

Abstract

BACKGROUND

Tumor documentation in Germany is currently a largely manual process. It involves reading the textual patient documentation and filling in forms in dedicated databases to obtain structured data. Advances in information extraction techniques that build on large language models (LLMs) could have the potential for enhancing the efficiency and reliability of this process. Evaluating LLMs in the German medical domain, especially their ability to interpret specialized language, is essential to determine their suitability for the use in clinical documentation. Due to data protection regulations, only locally deployed open source LLMs are generally suitable for this application.

METHODS

The evaluation employs eleven different open source LLMs with sizes ranging from 1 to 70 billion model parameters. Three basic tasks were selected as representative examples for the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general.

RESULTS

The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation.

CONCLUSIONS

Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval . We also release the data set under https://huggingface.co/datasets/stefan-m-lenz/UroLlmEvalSet providing a valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.

摘要

背景

在德国，肿瘤文档记录目前在很大程度上是一个手动过程。它包括阅读患者的文本病历并在专用数据库中填写表格以获取结构化数据。基于大语言模型（LLMs）的信息提取技术的进步可能有潜力提高这个过程的效率和可靠性。评估德语医学领域的大语言模型，特别是它们解释专业语言的能力，对于确定它们在临床文档记录中的适用性至关重要。由于数据保护法规，通常只有本地部署的开源大语言模型适用于此应用。

方法

评估采用了11种不同的开源大语言模型，模型参数大小从10亿到700亿不等。选择了三个基本任务作为肿瘤文档记录过程的代表性示例：识别肿瘤诊断、分配ICD - 10编码以及提取首次诊断日期。为了在这些任务上评估大语言模型，基于泌尿科匿名医生笔记准备了一个带注释文本片段的数据集。使用了不同的提示策略来研究少样本提示中示例数量的影响，并总体探索大语言模型的能力。

结果

Llama 3.1 8B、Mistral 7B和Mistral NeMo 12 B模型在任务中表现相当出色。训练数据较少或参数少于70亿的模型表现明显较低，而更大的模型并未显示出性能提升。来自泌尿科以外不同医学领域的示例也可以改善少样本提示的结果，这证明了大语言模型处理肿瘤文档记录所需任务的能力。

结论

开源大语言模型在自动化肿瘤文档记录方面显示出强大的潜力。参数在70亿到120亿之间的模型可能在性能和资源效率之间提供最佳平衡。通过定制的微调以及精心设计的提示，这些模型未来可能会成为临床文档记录的重要工具。评估代码可从https://github.com/stefan - m - lenz/UroLlmEval获取。我们还在https://huggingface.co/datasets/stefan - m - lenz/UroLlmEvalSet下发布了数据集，提供了一个有价值的资源，解决了德语医学自然语言处理中真实且易于获取的基准测试短缺的问题。