ChatGPT 对无创伤性胸痛患者的风险分层不一致。

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain.

机构信息

Department of Family Medicine, University of Washington School of Medicine, Seattle, Washington, United States of America.

Department of Medical Education and Clinical Sciences, Washington State University, Spokane, Washington, United States of America.

出版信息

PLoS One. 2024 Apr 16;19(4):e0301854. doi: 10.1371/journal.pone.0301854. eCollection 2024.

DOI:10.1371/journal.pone.0301854

PMID:38626142

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11020975/

Abstract

BACKGROUND

ChatGPT-4 is a large language model with promising healthcare applications. However, its ability to analyze complex clinical data and provide consistent results is poorly known. Compared to validated tools, this study evaluated ChatGPT-4's risk stratification of simulated patients with acute nontraumatic chest pain.

METHODS

Three datasets of simulated case studies were created: one based on the TIMI score variables, another on HEART score variables, and a third comprising 44 randomized variables related to non-traumatic chest pain presentations. ChatGPT-4 independently scored each dataset five times. Its risk scores were compared to calculated TIMI and HEART scores. A model trained on 44 clinical variables was evaluated for consistency.

RESULTS

ChatGPT-4 showed a high correlation with TIMI and HEART scores (r = 0.898 and 0.928, respectively), but the distribution of individual risk assessments was broad. ChatGPT-4 gave a different risk 45-48% of the time for a fixed TIMI or HEART score. On the 44-variable model, a majority of the five ChatGPT-4 models agreed on a diagnosis category only 56% of the time, and risk scores were poorly correlated (r = 0.605).

CONCLUSION

While ChatGPT-4 correlates closely with established risk stratification tools regarding mean scores, its inconsistency when presented with identical patient data on separate occasions raises concerns about its reliability. The findings suggest that while large language models like ChatGPT-4 hold promise for healthcare applications, further refinement and customization are necessary, particularly in the clinical risk assessment of atraumatic chest pain patients.

摘要

背景

ChatGPT-4 是一款具有广阔医疗应用前景的大型语言模型。然而，其分析复杂临床数据并提供一致结果的能力尚不清楚。与经过验证的工具相比，本研究评估了 ChatGPT-4 对模拟急性非创伤性胸痛患者的风险分层能力。

方法

创建了三个模拟病例研究数据集：一个基于 TIMI 评分变量，另一个基于 HEART 评分变量，第三个包含 44 个与非创伤性胸痛表现相关的随机变量。ChatGPT-4 独立地对每个数据集进行了五次评分。其风险评分与计算得出的 TIMI 和 HEART 评分进行了比较。还评估了一个基于 44 个临床变量的模型的一致性。

结果

ChatGPT-4 与 TIMI 和 HEART 评分高度相关（r 值分别为 0.898 和 0.928），但个体风险评估的分布较广。对于固定的 TIMI 或 HEART 评分，ChatGPT-4 给出不同风险的概率为 45-48%。在 44 个变量模型上，五个 ChatGPT-4 模型中有多数情况下仅在 56%的时间内对诊断类别达成一致，且风险评分相关性较差（r 值为 0.605）。

结论

虽然 ChatGPT-4 在平均得分方面与既定的风险分层工具密切相关，但在不同时间对相同患者数据进行评估时的不一致性引起了对其可靠性的担忧。研究结果表明，尽管像 ChatGPT-4 这样的大型语言模型在医疗保健应用方面具有广阔的前景，但需要进一步的改进和定制，特别是在对非创伤性胸痛患者进行临床风险评估时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d795/11020975/bcaa915b1ada/pone.0301854.g001.jpg

相似文献

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain.ChatGPT 对无创伤性胸痛患者的风险分层不一致。

PLoS One. 2024 Apr 16;19(4):e0301854. doi: 10.1371/journal.pone.0301854. eCollection 2024.

Evaluating the accuracy of ChatGPT-4 in predicting ASA scores: A prospective multicentric study ChatGPT-4 in ASA score prediction.评估 ChatGPT-4 预测 ASA 评分的准确性：一项前瞻性多中心研究 ChatGPT-4 在 ASA 评分预测中的应用。

J Clin Anesth. 2024 Sep;96:111475. doi: 10.1016/j.jclinane.2024.111475. Epub 2024 Apr 23.

Comparing HEART, TIMI, and GRACE scores for prediction of 30-day major adverse cardiac events in high acuity chest pain patients in the emergency department.比较HEART、TIMI和GRACE评分对急诊科高敏胸痛患者30天主要不良心脏事件的预测价值。

Int J Cardiol. 2016 Oct 15;221:759-64. doi: 10.1016/j.ijcard.2016.07.147. Epub 2016 Jul 10.

Application of the TIMI risk score for unstable angina and non-ST elevation acute coronary syndrome to an unselected emergency department chest pain population.将不稳定型心绞痛和非ST段抬高型急性冠状动脉综合征的TIMI风险评分应用于未经筛选的急诊科胸痛患者群体。

Acad Emerg Med. 2006 Jan;13(1):13-8. doi: 10.1197/j.aem.2005.06.031. Epub 2005 Dec 19.

Chest Pain Risk Scores Can Reduce Emergent Cardiac Imaging Test Needs With Low Major Adverse Cardiac Events Occurrence in an Emergency Department Observation Unit.胸痛风险评分可降低急诊科观察病房中严重不良心脏事件发生率较低时的紧急心脏成像检查需求。

Crit Pathw Cardiol. 2016 Dec;15(4):145-151. doi: 10.1097/HPC.0000000000000090.

Prospective evaluation of the use of the thrombolysis in myocardial infarction score as a risk stratification tool for chest pain patients admitted to an ED observation unit.前瞻性评估心肌梗死溶栓评分作为胸痛患者收入 ED 观察单元的风险分层工具的应用。

Am J Emerg Med. 2013 Jan;31(1):185-9. doi: 10.1016/j.ajem.2012.07.006. Epub 2012 Sep 1.

Predictive risk stratification using HEART (history, electrocardiogram, age, risk factors, and initial troponin) and TIMI (thrombolysis in myocardial infarction) scores in non-high risk chest pain patients: An African American urban community based hospital study.非高危胸痛患者中使用HEART（病史、心电图、年龄、危险因素和初始肌钙蛋白）和TIMI（心肌梗死溶栓）评分进行预测性风险分层：一项基于非裔美国城市社区医院的研究。

Medicine (Baltimore). 2019 Aug;98(32):e16370. doi: 10.1097/MD.0000000000016370.

Suicide Risk Assessments Through the Eyes of ChatGPT-3.5 Versus ChatGPT-4: Vignette Study.通过ChatGPT-3.5与ChatGPT-4视角进行的自杀风险评估：案例研究

JMIR Ment Health. 2023 Sep 20;10:e51232. doi: 10.2196/51232.

Chest pain unit using thrombolysis in myocardial infarction score risk stratification: an impact on the length of stay and cost savings.胸痛单元使用心肌梗死溶栓评分进行风险分层：对住院时间和成本节约的影响。

Crit Pathw Cardiol. 2012 Dec;11(4):206-10. doi: 10.1097/HPC.0b013e31826cc254.

Comparison of traditional cardiovascular risk models and coronary atherosclerotic plaque as detected by computed tomography for prediction of acute coronary syndrome in patients with acute chest pain.比较传统心血管风险模型与计算机断层扫描检测到的冠状动脉粥样硬化斑块，用于预测急性胸痛患者的急性冠状动脉综合征。

Acad Emerg Med. 2012 Aug;19(8):934-42. doi: 10.1111/j.1553-2712.2012.01417.x. Epub 2012 Jul 31.

引用本文的文献

A Practical Guide to the Utilization of ChatGPT in the Emergency Department: A Systematic Review of Current Applications, Future Directions, and Limitations.急诊科使用ChatGPT实用指南：当前应用、未来方向及局限性的系统评价

Cureus. 2025 Apr 6;17(4):e81802. doi: 10.7759/cureus.81802. eCollection 2025 Apr.

Automated computation of the HEART score with the GPT-4 large language model.使用GPT-4大语言模型自动计算HEART评分

Am J Emerg Med. 2025 Jul;93:120-125. doi: 10.1016/j.ajem.2025.03.065. Epub 2025 Apr 1.

Application of large language models in disease diagnosis and treatment.大语言模型在疾病诊断与治疗中的应用。

Chin Med J (Engl). 2025 Jan 20;138(2):130-142. doi: 10.1097/CM9.0000000000003456. Epub 2024 Dec 26.

Data sharing and reuse in clinical research: Are we there yet? A cross-sectional study on progress, challenges and opportunities in LMICs.临床研究中的数据共享与再利用：我们做到了吗？一项关于低收入和中等收入国家进展、挑战与机遇的横断面研究。

PLOS Glob Public Health. 2024 Nov 20;4(11):e0003392. doi: 10.1371/journal.pgph.0003392. eCollection 2024.

Exploring challenges in audiovisual translation: A comparative analysis of human- and AI-generated Arabic subtitles in Birdman.探索视听翻译中的挑战：《鸟人》中人工和 AI 生成的阿拉伯语字幕的对比分析。

PLoS One. 2024 Oct 21;19(10):e0311020. doi: 10.1371/journal.pone.0311020. eCollection 2024.

本文引用的文献

Large language models to identify social determinants of health in electronic health records.利用大语言模型识别电子健康记录中的健康社会决定因素。

NPJ Digit Med. 2024 Jan 11;7(1):6. doi: 10.1038/s41746-023-00970-0.

Caution regarding fabricated citations from artificial intelligence.关于人工智能伪造引用的注意事项。

Headache. 2024 Jan;64(1):3-4. doi: 10.1111/head.14649. Epub 2023 Oct 24.

The predictive value of machine learning for mortality risk in patients with acute coronary syndromes: a systematic review and meta-analysis.机器学习对急性冠状动脉综合征患者死亡风险的预测价值：系统评价和荟萃分析。

Eur J Med Res. 2023 Oct 20;28(1):451. doi: 10.1186/s40001-023-01027-4.

Large language models propagate race-based medicine.大语言模型传播基于种族的医学观念。

NPJ Digit Med. 2023 Oct 20;6(1):195. doi: 10.1038/s41746-023-00939-z.

ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references - a preliminary study.ChatGPT与干细胞研究中的人工幻觉：评估生成参考文献的准确性——一项初步研究

Ann Med Surg (Lond). 2023 Sep 1;85(10):5275-5278. doi: 10.1097/MS9.0000000000001228. eCollection 2023 Oct.

The use of artificial intelligence tools in cancer detection compared to the traditional diagnostic imaging methods: An overview of the systematic reviews.人工智能工具在癌症检测中的应用与传统诊断成像方法的比较：系统评价综述。

PLoS One. 2023 Oct 5;18(10):e0292063. doi: 10.1371/journal.pone.0292063. eCollection 2023.

ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology.基于皮肤病学专业证书考试的 ChatGPT-3.5 和 ChatGPT-4 皮肤科知识水平。

Clin Exp Dermatol. 2024 Jun 25;49(7):686-691. doi: 10.1093/ced/llad255.

ChatGPT broke the Turing test - the race is on for new ways to assess AI.ChatGPT 通过了图灵测试——评估人工智能新方法的竞争已经展开。

Nature. 2023 Jul;619(7971):686-689. doi: 10.1038/d41586-023-02361-7.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy.人工智能在乳腺癌筛查计划中的图像分析应用：测试准确性的系统评价。

BMJ. 2021 Sep 1;374:n1872. doi: 10.1136/bmj.n1872.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ChatGPT 对无创伤性胸痛患者的风险分层不一致。

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献