文献检索，用中文搜 PubMed

BACKGROUND

ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations.

OBJECTIVE

This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.

METHODS

We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People's Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.

RESULTS

Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t=5.753; P<.001). For secondary diagnoses, the accuracy rates of GPT-4.0 were also significantly higher than GPT-3.5 (mean 0.908, SD 0.159 vs mean 0.617, SD 0.349; t=-7.727; P<.001). GPT-3.5 showed limitations in processing patient history, symptom presentation, laboratory tests, and imaging data. While GPT-4.0 improved upon GPT-3.5, it still has limitations in identifying symptoms and laboratory test data. For both primary and secondary diagnoses, there was no significant difference in accuracy related to age, gender, or system group between GPT-4.0 and GPT-3.5.

CONCLUSIONS

This study demonstrates that ChatGPT, particularly GPT-4.0, possesses significant diagnostic potential, with GPT-4.0 exhibiting higher accuracy than GPT-3.5. However, GPT-4.0 still has limitations, particularly in recognizing patient symptoms and laboratory data, indicating a need for more research in real-world clinical settings to enhance its diagnostic capabilities.

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

ChatGPT（OpenAI）在临床诊断方面表现出巨大潜力，可能成为临床实践中的优秀辅助工具。本研究通过比较模型迭代中的 GPT-3.5 和 GPT-4.0 的性能，调查和评估 ChatGPT 在诊断能力方面的表现。

目的

本研究旨在评估 GPT-3.5 和 GPT-4.0 对结肠癌的精确诊断能力及其作为外科医生辅助诊断工具的潜力，并比较 GTP-3.5 和 GPT-4.0 的诊断准确率。我们准确评估了主要和次要诊断的准确性，并根据 7 个类别分析 GPT-3.5 和 GPT-4.0 误诊的原因：患者病史、症状、体征、实验室检查、影像学检查、病理检查和术中发现。

方法

我们从中华医学会出版数据库中检索了 316 例肠肿瘤病例报告，经过数据清理后，有 286 例被认为是有效的。将病例从中文翻译成英文，然后使用简单、直接的提示将其输入 GPT-3.5 和 GPT-4.0 中，以获取主要和次要诊断。我们进行了一项比较研究，以评估 GPT-4.0 和 GPT-3.5 的诊断准确性。来自解放军总医院普外科（普通外科）的 3 名高级外科医生对诊断信息进行了评估。根据预先设定的标准对主要和次要诊断的准确性进行评分。此外，我们根据 7 个类别（患者病史、症状、体征、实验室检查、影像学检查、病理检查和术中发现）分析和比较了两个模型中的误诊原因。

结果

在 286 例病例中，GPT-4.0 和 GPT-3.5 对主要诊断均表现出较高的诊断准确性，但 GPT-4.0 的准确性明显高于 GPT-3.5（均值 0.972，标准差 0.137 与均值 0.855，标准差 0.335；t=5.753；P<.001）。对于次要诊断，GPT-4.0 的准确性也明显高于 GPT-3.5（均值 0.908，标准差 0.159 与均值 0.617，标准差 0.349；t=-7.727；P<.001）。GPT-3.5 在处理患者病史、症状表现、实验室检查和影像学数据方面存在局限性。虽然 GPT-4.0 改进了 GPT-3.5，但它在识别症状和实验室检查数据方面仍存在局限性。对于主要和次要诊断，GPT-4.0 和 GPT-3.5 的准确性与年龄、性别或系统组之间没有显著差异。

结论

本研究表明，ChatGPT，特别是 GPT-4.0，具有显著的诊断潜力，GPT-4.0 的准确性高于 GPT-3.5。然而，GPT-4.0 仍存在局限性，特别是在识别患者症状和实验室数据方面，这表明需要在真实临床环境中进行更多研究，以提高其诊断能力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

GPT-3.5 和 GPT-4.0 在外科中的诊断能力：比较分析。

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

相似文献

引用本文的文献

本文引用的文献

GPT-3.5 和 GPT-4.0 在外科中的诊断能力：比较分析。

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献