• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT的回答一致性:关于医学考试问题重复查询的研究

ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions.

作者信息

Funk Paul F, Hoch Cosima C, Knoedler Samuel, Knoedler Leonard, Cotofana Sebastian, Sofo Giuseppe, Bashiri Dezfouli Ali, Wollenberg Barbara, Guntinas-Lichius Orlando, Alfertshofer Michael

机构信息

Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Jena, Friedrich Schiller University Jena, Am Klinikum 1, 07747 Jena, Germany.

Department of Otolaryngology, Head and Neck Surgery, School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675 Munich, Germany.

出版信息

Eur J Investig Health Psychol Educ. 2024 Mar 8;14(3):657-668. doi: 10.3390/ejihpe14030043.

DOI:10.3390/ejihpe14030043
PMID:38534904
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10969490/
Abstract

(1) Background: As the field of artificial intelligence (AI) evolves, tools like ChatGPT are increasingly integrated into various domains of medicine, including medical education and research. Given the critical nature of medicine, it is of paramount importance that AI tools offer a high degree of reliability in the information they provide. (2) Methods: A total of = 450 medical examination questions were manually entered into ChatGPT thrice, each for ChatGPT 3.5 and ChatGPT 4. The responses were collected, and their accuracy and consistency were statistically analyzed throughout the series of entries. (3) Results: ChatGPT 4 displayed a statistically significantly improved accuracy with 85.7% compared to that of 57.7% of ChatGPT 3.5 ( < 0.001). Furthermore, ChatGPT 4 was more consistent, correctly answering 77.8% across all rounds, a significant increase from the 44.9% observed from ChatGPT 3.5 ( < 0.001). (4) Conclusions: The findings underscore the increased accuracy and dependability of ChatGPT 4 in the context of medical education and potential clinical decision making. Nonetheless, the research emphasizes the indispensable nature of human-delivered healthcare and the vital role of continuous assessment in leveraging AI in medicine.

摘要

(1) 背景:随着人工智能(AI)领域的发展,ChatGPT等工具越来越多地融入医学的各个领域,包括医学教育和研究。鉴于医学的关键性质,人工智能工具所提供信息的高度可靠性至关重要。(2) 方法:总共450道医学考试题目被手动输入ChatGPT三次,每次分别针对ChatGPT 3.5和ChatGPT 4。收集回复,并对整个输入系列中回复的准确性和一致性进行统计分析。(3) 结果:ChatGPT 4的准确率在统计学上有显著提高,为85.7%,而ChatGPT 3.5为57.7%(P<0.001)。此外,ChatGPT 4更具一致性,在所有轮次中正确回答率为77.8%,较ChatGPT 3.5观察到的44.9%有显著提高(P<0.001)。(4) 结论:这些发现强调了ChatGPT 4在医学教育和潜在临床决策背景下准确性和可靠性的提高。尽管如此,该研究强调了人工提供医疗服务的不可或缺性以及持续评估在医学中利用人工智能方面的关键作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fb9/10969490/1724676a462f/ejihpe-14-00043-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fb9/10969490/c8ce9fd539c9/ejihpe-14-00043-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fb9/10969490/1724676a462f/ejihpe-14-00043-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fb9/10969490/c8ce9fd539c9/ejihpe-14-00043-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fb9/10969490/1724676a462f/ejihpe-14-00043-g002.jpg

相似文献

1
ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions.ChatGPT的回答一致性:关于医学考试问题重复查询的研究
Eur J Investig Health Psychol Educ. 2024 Mar 8;14(3):657-668. doi: 10.3390/ejihpe14030043.
2
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现:调查研究。
JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.
3
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响:来自台湾护理执照考试的见解。
Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.
4
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.
5
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
6
Evaluating ChatGPT's Ability to Solve Higher-Order Questions on the Competency-Based Medical Education Curriculum in Medical Biochemistry.评估ChatGPT解决医学基础生物化学基于能力的医学教育课程中高阶问题的能力。
Cureus. 2023 Apr 2;15(4):e37023. doi: 10.7759/cureus.37023. eCollection 2023 Apr.
7
Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).ChatGPT 在医学中作为 AI 辅助决策支持工具的性能:解释常见心脏疾病症状和管理的概念验证研究 (AMSTELHEART-2)。
Acta Cardiol. 2024 May;79(3):358-366. doi: 10.1080/00015385.2024.2303528. Epub 2024 Feb 13.
8
How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language.ChatGPT-4在非英语国家医学执照考试中的表现如何?中文语言环境下的一项评估。
PLOS Digit Health. 2023 Dec 1;2(12):e0000397. doi: 10.1371/journal.pdig.0000397. eCollection 2023 Dec.
9
Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.优化 ChatGPT 对谵妄评估结果的解释和报告:探索性研究。
JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383.
10
Assessment of ChatGPT's performance on neurology written board examination questions.ChatGPT在神经病学笔试问题上的表现评估。
BMJ Neurol Open. 2023 Nov 2;5(2):e000530. doi: 10.1136/bmjno-2023-000530. eCollection 2023.

引用本文的文献

1
Comparative evaluation of large language models performance in medical education using urinary system histology assessment.使用泌尿系统组织学评估对大型语言模型在医学教育中的表现进行比较评估。
Sci Rep. 2025 Aug 29;15(1):31933. doi: 10.1038/s41598-025-17571-4.
2
Automatic- and Transformer-Based Automatic Item Generation: A Critical Review.基于自动和Transformer的自动试题生成:批判性综述
J Intell. 2025 Aug 12;13(8):102. doi: 10.3390/jintelligence13080102.
3
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.

本文引用的文献

1
Utilization of ChatGPT in Medical Education: Applications and Implications for Curriculum Enhancement.ChatGPT在医学教育中的应用:对课程改进的应用及启示
Acta Inform Med. 2023;31(4):300-305. doi: 10.5455/aim.2023.31.300-305.
2
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
3
Using ChatGPT for Clinical Practice and Medical Education: Cross-Sectional Survey of Medical Students' and Physicians' Perceptions.
五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
4
Addressing Commonly Asked Questions in Urogynecology: Accuracy and Limitations of ChatGPT.解答泌尿妇科常见问题:ChatGPT的准确性与局限性
Int Urogynecol J. 2025 Jun 18. doi: 10.1007/s00192-025-06184-0.
5
Large language models in oncology: a review.肿瘤学中的大语言模型:综述
BMJ Oncol. 2025 May 15;4(1):e000759. doi: 10.1136/bmjonc-2025-000759. eCollection 2025.
6
Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation.生成式人工智能在临床肾脏病学中的辅助评估:评估GPT-4、GPT-4o、Gemini 1.0 Ultra和PaLM 2在患者互动及肾活检解读中的表现
Digit Health. 2025 Jun 2;11:20552076251342067. doi: 10.1177/20552076251342067. eCollection 2025 Jan-Dec.
7
The role of artificial intelligence in predicting injured structures based on clinical images of lacerations in the volar aspect of the hand and forearm.人工智能在基于手掌和前臂掌侧撕裂伤临床图像预测损伤结构方面的作用。
J Hand Microsurg. 2025 Apr 9;17(4):100255. doi: 10.1016/j.jham.2025.100255. eCollection 2025 Jul.
8
Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.在耳鼻喉科委员会考试中利用先进的大语言模型:使用Python和应用程序编程接口的调查
Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.
9
Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.评估ChatGPT对放疗相关患者问题回答的质量和可靠性:与GPT-3.5和GPT-4的比较研究
JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.
10
Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions.分析影响ChatGPT在3000道美国医师执照考试(USMLE®)风格题目中表现的问题特征
Med Sci Educ. 2024 Sep 28;35(1):257-267. doi: 10.1007/s40670-024-02176-9. eCollection 2025 Feb.
使用 ChatGPT 进行临床实践和医学教育:医学生和医生认知的横断面调查。
JMIR Med Educ. 2023 Dec 22;9:e50658. doi: 10.2196/50658.
4
ChatGPT's advice is perceived as better than that of professional advice columnists.ChatGPT的建议被认为比专业建议专栏作家的建议更好。
Front Psychol. 2023 Nov 21;14:1281255. doi: 10.3389/fpsyg.2023.1281255. eCollection 2023.
5
The Associations Between United States Medical Licensing Examination Performance and Outcomes of Patient Care.美国医师执照考试表现与患者治疗结局的关联。
Acad Med. 2024 Mar 1;99(3):325-330. doi: 10.1097/ACM.0000000000005480. Epub 2023 Oct 9.
6
Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.比较 ChatGPT 和 GPT-4 在 USMLE 软技能评估中的表现。
Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.
7
Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.ChatGPT在秘鲁国家医学执照考试中的表现:横断面研究
JMIR Med Educ. 2023 Sep 28;9:e48039. doi: 10.2196/48039.
8
Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.大语言模型在血液学病例解决中的应用:ChatGPT-3.5、谷歌巴德和微软必应的比较研究
Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.
9
Assessing the accuracy of ChatGPT references in head and neck and ENT disciplines.评估 ChatGPT 参考文献在头颈部和耳鼻喉科领域的准确性。
Eur Arch Otorhinolaryngol. 2023 Nov;280(11):5129-5133. doi: 10.1007/s00405-023-08205-4. Epub 2023 Sep 8.
10
Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。
J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.