文献检索，用中文搜 PubMed

BACKGROUND

Large language models (LLMs) such as ChatGPT-4, LLaMA-3.1, Gemini-1.5, DeepSeek-R1, and OpenAI-O3 have shown promising potential in health care, particularly for clinical reasoning and decision support. However, their reliability across critical tasks like diagnosis, medical coding, and risk prediction has received mixed reviews, especially in real-world settings without task-specific training.

OBJECTIVE

This study aims to evaluate and compare the zero-shot performance of reasoning and nonreasoning LLMs in three essential clinical tasks: (1) primary diagnosis generation, (2) ICD-9 (International Classification of Diseases, Ninth Revision) medical code prediction, and (3) hospital readmission risk stratification. The goal is to assess whether these models can serve as general-purpose clinical decision support tools and to identify gaps in current capabilities.

METHODS

Using the Medical Information Mart for Intensive Care-IV dataset, we selected a random cohort of 300 hospital discharge summaries. Prompts were engineered to include structured clinical content from 5 note sections: chief complaints, past medical history, surgical history, laboratories, and imaging. Prompts were standardized and zero-shot, with no model fine-tuning or repetition across runs. All model interactions were conducted through publicly available web user interfaces, without using application programming interfaces, to simulate real-world accessibility for nontechnical users. We incorporated rationale elicitation into prompts to evaluate model transparency, especially in reasoning models. Ground-truth labels were derived from the primary diagnosis documented in clinical notes, structured ICD-9 codes from diagnosis, and hospital-recorded readmission frequencies for risk stratification. Performance was measured using F1-scores and correctness percentages, and comparative performance was analyzed statistically.

RESULTS

Among nonreasoning models, LLaMA-3.1 achieved the highest primary diagnosis accuracy (n=255, 85%), followed by ChatGPT-4 (n=254, 84.7%) and Gemini-1.5 (n=237, 79%). For ICD-9 prediction, correctness dropped significantly across all models: LLaMA-3.1 (n=128, 42.6%), ChatGPT-4 (n=122, 40.6%), and Gemini-1.5 (n=44, 14.6%). Hospital readmission risk prediction showed low performance in nonreasoning models: LLaMA-3.1 (n=124, 41.3%), Gemini-1.5 (n=122, 40.7%), and ChatGPT-4 (n=99, 33%). Among reasoning models, OpenAI-O3 outperformed in diagnosis (n=270, 90%) and ICD-9 coding (n=136, 45.3%), while DeepSeek-R1 performed slightly better in the readmission risk prediction (n=218, 72.6% vs O3's n=212, 70.6%). Despite improved explainability, reasoning models generated verbose responses. None of the models met clinical standards across all tasks, and performance in medical coding remained the weakest area across all models.

CONCLUSIONS

Current LLMs exhibit moderate success in zero-shot diagnosis and risk prediction but underperform in ICD-9 code generation, reinforcing findings from prior studies. Reasoning models offer marginally better performance and increased interpretability, with limited reliability. Overall, statistical analysis between the models revealed that OpenAI-O3 outperformed the other models. These results highlight the need for task-specific fine-tuning and need human-in-the-loop checking. Future work will explore fine-tuning, stability through repeated trials, and evaluation on a different subset of deidentified real-world data with a larger sample size.

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

ChatGPT-4、LLaMA-3.1、Gemini-1.5、DeepSeek-R1和OpenAI-O3等大语言模型在医疗保健领域已展现出可观的潜力，尤其是在临床推理和决策支持方面。然而，它们在诊断、医学编码和风险预测等关键任务中的可靠性评价不一，特别是在未经特定任务训练的现实环境中。

目的

本研究旨在评估和比较推理型和非推理型大语言模型在三项基本临床任务中的零样本性能：（1）生成初步诊断，（2）预测ICD-9（国际疾病分类第九版）医学编码，以及（3）对医院再入院风险进行分层。目的是评估这些模型是否可作为通用临床决策支持工具，并找出当前能力中的差距。

方法

使用重症监护医学信息集市-IV数据集，我们随机选取了300份医院出院小结。设计的提示包括来自5个记录部分的结构化临床内容：主诉、既往病史、手术史、实验室检查和影像学检查。提示经过标准化且为零样本，在运行过程中不进行模型微调或重复。所有模型交互均通过公开可用的网络用户界面进行，不使用应用程序编程接口，以模拟非技术用户在现实世界中的可访问性。我们将理由引出纳入提示中，以评估模型的透明度，特别是在推理模型中。真实标签来自临床记录中记录的初步诊断、诊断的结构化ICD-9编码以及用于风险分层的医院记录的再入院频率。使用F1分数和正确率进行性能测量，并对比较性能进行统计分析。

结果

在非推理模型中，LLaMA-3.1的初步诊断准确率最高（n = 255，85%），其次是ChatGPT-4（n = 254，84.7%）和Gemini-1.5（n = 237，79%）。对于ICD-9预测，所有模型的正确率均显著下降：LLaMA-3.1（n = 128，42.6%）、ChatGPT-4（n = 122，40.6%）和Gemini-1.5（n = 44，14.6%）。医院再入院风险预测在非推理模型中的表现较低：LLaMA-3.1（n = 124，41.3%）、Gemini-1.5（n = 122，40.7%）和ChatGPT-4（n = 99，33%）。在推理模型中，OpenAI-O3在诊断（n = 270，90%）和ICD-9编码（n = 136，45.3%）方面表现出色，而DeepSeek-R1在再入院风险预测方面表现稍好（n = 218，72.6%，而O3为n = 212，70.6%）。尽管可解释性有所提高，但推理模型生成的回复冗长。没有一个模型在所有任务中都达到临床标准，并且医学编码方面的性能在所有模型中仍然是最薄弱的领域。

结论

当前的大语言模型在零样本诊断和风险预测方面取得了一定成功，但在ICD-9编码生成方面表现不佳，这强化了先前研究的结果。推理模型的性能略好且可解释性增强，但可靠性有限。总体而言，模型之间的统计分析表明OpenAI-O3优于其他模型。这些结果凸显了进行特定任务微调以及人工介入检查的必要性。未来的工作将探索微调、通过重复试验提高稳定性，以及在更大样本量的不同去识别化现实世界数据子集上进行评估。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力：零样本提示方法。

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

相似文献

本文引用的文献

评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力：零样本提示方法。

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献