Lehmann Sebastian, Wilhelmy Florian, von Dercks Nikolaus, Güresir Erdem, Wach Johannes
Department of Neurosurgery, University Hospital Leipzig, 04103, Leipzig, Germany.
Medical Management, University Hospital Leipzig, 04103, Leipzig, Germany.
Acta Neurochir (Wien). 2025 Jul 31;167(1):209. doi: 10.1007/s00701-025-06631-3.
In the German medical billing system, surgical departments encode their procedures in OPS-codes. These OPS-codes have major impact on DRG grouping and thus mainly determine each case´s revenue. In our study, we investigate the ability of the Large Language Model (LLM) GPT to derive correct OPS codes from the surgical report.
For our study, 100 patients who underwent meningioma surgery at our clinic between 2023 and 2024 were examined. We recorded the OPS codes assigned by the surgeon after the procedure, as well as the final coding by the hospital´s coders before case closure. In addition, the surgical report was extracted and anonymously provided to GPT-4o and GPT CodeMedic together with the current OPS-catalogue. The coding of each group was analyzed descriptively and compared using the Chi-Square test. Additionally, errors and deviations were assessed and analyzed.
In our analyses, coders (100%) and surgeons (99%) demonstrated to significantly perform higher than LLMs in sufficient coding, for which the basic coding must be correct and unquestionable (GPT-4o 78%, GPT CodeMedic 89%; p < 0.01). For optimal coding, where every code potentially contributing to increase the revenue must be included, only the coders (94%) achieved superiority (GPT-4o p < 0.01; GPT CodeMedic p = 0.02), whereas GPT CodeMedic (83%) even outperformed surgeons (69%) (p = 0.03). The specialized GPT CodeMedic tends to show fewer hallucinations compared to GPT-4o (7% vs. 15%).
GPT is capable of extracting OPS codes from surgical reports. The most frequent errors made by LLMs can be attributed to a lack of specialized training. Currently, professional coders still significantly outperform LLMs in sufficient and optimal coding. For optimal coding however, GPT shows to perform comparably to surgeon´s coding skills. This indicates, that in near future after further training, LLMs may take over this task from surgeons without loss in quality.
在德国医疗计费系统中,外科科室使用手术操作分类编码(OPS编码)对其手术程序进行编码。这些OPS编码对疾病诊断相关分组(DRG)有重大影响,因此主要决定每个病例的收入。在我们的研究中,我们调查了大语言模型(LLM)GPT从手术报告中得出正确OPS编码的能力。
在我们的研究中,对2023年至2024年期间在我们诊所接受脑膜瘤手术的100名患者进行了检查。我们记录了手术后外科医生分配的OPS编码,以及病例结束前医院编码员的最终编码。此外,提取手术报告并与当前的OPS目录一起匿名提供给GPT-4o和GPT CodeMedic。对每组编码进行描述性分析,并使用卡方检验进行比较。此外,对错误和偏差进行评估和分析。
在我们的分析中,编码员(100%)和外科医生(99%)在充分编码方面的表现明显高于大语言模型,充分编码要求基本编码必须正确且无可争议(GPT-4o为78%,GPT CodeMedic为89%;p<0.01)。对于最佳编码,即必须包含每一个可能增加收入的编码,只有编码员(94%)表现出色(GPT-4o,p<0.01;GPT CodeMedic,p=0.02),而GPT CodeMedic(83%)甚至超过了外科医生(69%)(p=0.03)。与GPT-4o相比,专门的GPT CodeMedic出现幻觉的情况较少(7%对15%)。
GPT能够从手术报告中提取OPS编码。大语言模型最常见的错误可归因于缺乏专业训练。目前,专业编码员在充分编码和最佳编码方面仍明显优于大语言模型。然而,对于最佳编码,GPT的表现与外科医生的编码技能相当。这表明,在经过进一步训练后的不久将来,大语言模型可能会从外科医生手中接管这项任务,且质量不会下降。