• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.

作者信息

Naliyatthaliyazchayil Parvati, Muthyala Raajitha, Gichoya Judy Wawira, Purkayastha Saptarshi

机构信息

Department of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing and Engineering, Indiana University Indianapolis, 535 W Michigan Street, Indianapolis, IN, 46202, United States, 1 317 274 0439.

Department of Radiology and Imaging Sciences, Emory University School of Medicine, Emory University, Atlanta, GA, United States.

出版信息

J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.

DOI:10.2196/74142
PMID:40737604
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12310144/
Abstract

BACKGROUND

Large language models (LLMs) such as ChatGPT-4, LLaMA-3.1, Gemini-1.5, DeepSeek-R1, and OpenAI-O3 have shown promising potential in health care, particularly for clinical reasoning and decision support. However, their reliability across critical tasks like diagnosis, medical coding, and risk prediction has received mixed reviews, especially in real-world settings without task-specific training.

OBJECTIVE

This study aims to evaluate and compare the zero-shot performance of reasoning and nonreasoning LLMs in three essential clinical tasks: (1) primary diagnosis generation, (2) ICD-9 (International Classification of Diseases, Ninth Revision) medical code prediction, and (3) hospital readmission risk stratification. The goal is to assess whether these models can serve as general-purpose clinical decision support tools and to identify gaps in current capabilities.

METHODS

Using the Medical Information Mart for Intensive Care-IV dataset, we selected a random cohort of 300 hospital discharge summaries. Prompts were engineered to include structured clinical content from 5 note sections: chief complaints, past medical history, surgical history, laboratories, and imaging. Prompts were standardized and zero-shot, with no model fine-tuning or repetition across runs. All model interactions were conducted through publicly available web user interfaces, without using application programming interfaces, to simulate real-world accessibility for nontechnical users. We incorporated rationale elicitation into prompts to evaluate model transparency, especially in reasoning models. Ground-truth labels were derived from the primary diagnosis documented in clinical notes, structured ICD-9 codes from diagnosis, and hospital-recorded readmission frequencies for risk stratification. Performance was measured using F1-scores and correctness percentages, and comparative performance was analyzed statistically.

RESULTS

Among nonreasoning models, LLaMA-3.1 achieved the highest primary diagnosis accuracy (n=255, 85%), followed by ChatGPT-4 (n=254, 84.7%) and Gemini-1.5 (n=237, 79%). For ICD-9 prediction, correctness dropped significantly across all models: LLaMA-3.1 (n=128, 42.6%), ChatGPT-4 (n=122, 40.6%), and Gemini-1.5 (n=44, 14.6%). Hospital readmission risk prediction showed low performance in nonreasoning models: LLaMA-3.1 (n=124, 41.3%), Gemini-1.5 (n=122, 40.7%), and ChatGPT-4 (n=99, 33%). Among reasoning models, OpenAI-O3 outperformed in diagnosis (n=270, 90%) and ICD-9 coding (n=136, 45.3%), while DeepSeek-R1 performed slightly better in the readmission risk prediction (n=218, 72.6% vs O3's n=212, 70.6%). Despite improved explainability, reasoning models generated verbose responses. None of the models met clinical standards across all tasks, and performance in medical coding remained the weakest area across all models.

CONCLUSIONS

Current LLMs exhibit moderate success in zero-shot diagnosis and risk prediction but underperform in ICD-9 code generation, reinforcing findings from prior studies. Reasoning models offer marginally better performance and increased interpretability, with limited reliability. Overall, statistical analysis between the models revealed that OpenAI-O3 outperformed the other models. These results highlight the need for task-specific fine-tuning and need human-in-the-loop checking. Future work will explore fine-tuning, stability through repeated trials, and evaluation on a different subset of deidentified real-world data with a larger sample size.

摘要

背景

ChatGPT-4、LLaMA-3.1、Gemini-1.5、DeepSeek-R1和OpenAI-O3等大语言模型在医疗保健领域已展现出可观的潜力,尤其是在临床推理和决策支持方面。然而,它们在诊断、医学编码和风险预测等关键任务中的可靠性评价不一,特别是在未经特定任务训练的现实环境中。

目的

本研究旨在评估和比较推理型和非推理型大语言模型在三项基本临床任务中的零样本性能:(1)生成初步诊断,(2)预测ICD-9(国际疾病分类第九版)医学编码,以及(3)对医院再入院风险进行分层。目的是评估这些模型是否可作为通用临床决策支持工具,并找出当前能力中的差距。

方法

使用重症监护医学信息集市-IV数据集,我们随机选取了300份医院出院小结。设计的提示包括来自5个记录部分的结构化临床内容:主诉、既往病史、手术史、实验室检查和影像学检查。提示经过标准化且为零样本,在运行过程中不进行模型微调或重复。所有模型交互均通过公开可用的网络用户界面进行,不使用应用程序编程接口,以模拟非技术用户在现实世界中的可访问性。我们将理由引出纳入提示中,以评估模型的透明度,特别是在推理模型中。真实标签来自临床记录中记录的初步诊断、诊断的结构化ICD-9编码以及用于风险分层的医院记录的再入院频率。使用F1分数和正确率进行性能测量,并对比较性能进行统计分析。

结果

在非推理模型中,LLaMA-3.1的初步诊断准确率最高(n = 255,85%),其次是ChatGPT-4(n = 254,84.7%)和Gemini-1.5(n = 237,79%)。对于ICD-9预测,所有模型的正确率均显著下降:LLaMA-3.1(n = 128,42.6%)、ChatGPT-4(n = 122,40.6%)和Gemini-1.5(n = 44,14.6%)。医院再入院风险预测在非推理模型中的表现较低:LLaMA-3.1(n = 124,41.3%)、Gemini-1.5(n = 122,40.7%)和ChatGPT-4(n = 99,33%)。在推理模型中,OpenAI-O3在诊断(n = 270,90%)和ICD-9编码(n = 136,45.3%)方面表现出色,而DeepSeek-R1在再入院风险预测方面表现稍好(n = 218,72.6%,而O3为n = 212,70.6%)。尽管可解释性有所提高,但推理模型生成的回复冗长。没有一个模型在所有任务中都达到临床标准,并且医学编码方面的性能在所有模型中仍然是最薄弱的领域。

结论

当前的大语言模型在零样本诊断和风险预测方面取得了一定成功,但在ICD-9编码生成方面表现不佳,这强化了先前研究的结果。推理模型的性能略好且可解释性增强,但可靠性有限。总体而言,模型之间的统计分析表明OpenAI-O3优于其他模型。这些结果凸显了进行特定任务微调以及人工介入检查的必要性。未来的工作将探索微调、通过重复试验提高稳定性,以及在更大样本量的不同去识别化现实世界数据子集上进行评估。

相似文献

1
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
2
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
3
Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study.使用检索增强大语言模型预测术后30天死亡率和美国麻醉医师协会身体状况:开发与验证研究
J Med Internet Res. 2025 Jun 3;27:e75052. doi: 10.2196/75052.
4
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
5
DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.在双语复杂眼科推理方面,DeepSeek-R1的表现优于Gemini 2.0 Pro、OpenAI的o1和o3-mini。
Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.
6
Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.在德国,开源大语言模型可用于肿瘤记录吗?——对泌尿科医生笔记的评估
BioData Min. 2025 Jul 24;18(1):48. doi: 10.1186/s13040-025-00463-8.
7
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
8
Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.用于人格特质预测的大语言模型嵌入的心理测量评估
J Med Internet Res. 2025 Jul 8;27:e75347. doi: 10.2196/75347.
9
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.基于临床文本的大语言模型症状识别:多中心研究。
J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.
10
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

1
Large language models for disease diagnosis: a scoping review.用于疾病诊断的大语言模型:一项范围综述。
NPJ Artif Intell. 2025;1(1):9. doi: 10.1038/s44387-025-00011-z. Epub 2025 Jun 9.
2
Towards reconciling usability and usefulness of policy explanations for sequential decision-making systems.调和顺序决策系统政策解释的可用性和有用性
Front Robot AI. 2024 Jul 22;11:1375490. doi: 10.3389/frobt.2024.1375490. eCollection 2024.
3
Large language models in health care: Development, applications, and challenges.医疗保健领域的大语言模型:发展、应用与挑战。
Health Care Sci. 2023 Jul 24;2(4):255-263. doi: 10.1002/hcs2.61. eCollection 2023 Aug.
4
Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases.评估大型语言模型在复杂多模态医疗案例中的诊断性能。
J Med Internet Res. 2024 May 13;26:e53724. doi: 10.2196/53724.
5
Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力:一项模型评估研究。
Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.
6
"Shortcuts" Causing Bias in Radiology Artificial Intelligence: Causes, Evaluation, and Mitigation.“捷径”导致放射科人工智能产生偏见:原因、评估和缓解。
J Am Coll Radiol. 2023 Sep;20(9):842-851. doi: 10.1016/j.jacr.2023.06.025. Epub 2023 Jul 27.
7
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
8
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
9
MIMIC-IV, a freely accessible electronic health record dataset.MIMIC-IV,一个可自由访问的电子健康记录数据集。
Sci Data. 2023 Jan 3;10(1):1. doi: 10.1038/s41597-022-01899-x.
10
Privacy in the age of medical big data.医疗大数据时代的隐私问题。
Nat Med. 2019 Jan;25(1):37-43. doi: 10.1038/s41591-018-0272-7. Epub 2019 Jan 7.