Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China.
Department of Respiratory and Critical Care Medicine, The First Medical Centre, Chinese PLA General Hospital, Beijing, China.
J Med Internet Res. 2024 Sep 10;26:e54985. doi: 10.2196/54985.
ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations.
This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.
We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People's Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.
Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t=5.753; P<.001). For secondary diagnoses, the accuracy rates of GPT-4.0 were also significantly higher than GPT-3.5 (mean 0.908, SD 0.159 vs mean 0.617, SD 0.349; t=-7.727; P<.001). GPT-3.5 showed limitations in processing patient history, symptom presentation, laboratory tests, and imaging data. While GPT-4.0 improved upon GPT-3.5, it still has limitations in identifying symptoms and laboratory test data. For both primary and secondary diagnoses, there was no significant difference in accuracy related to age, gender, or system group between GPT-4.0 and GPT-3.5.
This study demonstrates that ChatGPT, particularly GPT-4.0, possesses significant diagnostic potential, with GPT-4.0 exhibiting higher accuracy than GPT-3.5. However, GPT-4.0 still has limitations, particularly in recognizing patient symptoms and laboratory data, indicating a need for more research in real-world clinical settings to enhance its diagnostic capabilities.
ChatGPT(OpenAI)在临床诊断方面表现出巨大潜力,可能成为临床实践中的优秀辅助工具。本研究通过比较模型迭代中的 GPT-3.5 和 GPT-4.0 的性能,调查和评估 ChatGPT 在诊断能力方面的表现。
本研究旨在评估 GPT-3.5 和 GPT-4.0 对结肠癌的精确诊断能力及其作为外科医生辅助诊断工具的潜力,并比较 GTP-3.5 和 GPT-4.0 的诊断准确率。我们准确评估了主要和次要诊断的准确性,并根据 7 个类别分析 GPT-3.5 和 GPT-4.0 误诊的原因:患者病史、症状、体征、实验室检查、影像学检查、病理检查和术中发现。
我们从中华医学会出版数据库中检索了 316 例肠肿瘤病例报告,经过数据清理后,有 286 例被认为是有效的。将病例从中文翻译成英文,然后使用简单、直接的提示将其输入 GPT-3.5 和 GPT-4.0 中,以获取主要和次要诊断。我们进行了一项比较研究,以评估 GPT-4.0 和 GPT-3.5 的诊断准确性。来自解放军总医院普外科(普通外科)的 3 名高级外科医生对诊断信息进行了评估。根据预先设定的标准对主要和次要诊断的准确性进行评分。此外,我们根据 7 个类别(患者病史、症状、体征、实验室检查、影像学检查、病理检查和术中发现)分析和比较了两个模型中的误诊原因。
在 286 例病例中,GPT-4.0 和 GPT-3.5 对主要诊断均表现出较高的诊断准确性,但 GPT-4.0 的准确性明显高于 GPT-3.5(均值 0.972,标准差 0.137 与均值 0.855,标准差 0.335;t=5.753;P<.001)。对于次要诊断,GPT-4.0 的准确性也明显高于 GPT-3.5(均值 0.908,标准差 0.159 与均值 0.617,标准差 0.349;t=-7.727;P<.001)。GPT-3.5 在处理患者病史、症状表现、实验室检查和影像学数据方面存在局限性。虽然 GPT-4.0 改进了 GPT-3.5,但它在识别症状和实验室检查数据方面仍存在局限性。对于主要和次要诊断,GPT-4.0 和 GPT-3.5 的准确性与年龄、性别或系统组之间没有显著差异。
本研究表明,ChatGPT,特别是 GPT-4.0,具有显著的诊断潜力,GPT-4.0 的准确性高于 GPT-3.5。然而,GPT-4.0 仍存在局限性,特别是在识别患者症状和实验室数据方面,这表明需要在真实临床环境中进行更多研究,以提高其诊断能力。