评估人工智能生成的医学回复的准确性和可靠性：对Chat-GPT模型的评估

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model.

作者信息

Johnson Douglas, Goodman Rachel, Patrinely J, Stone Cosby, Zimmerman Eli, Donald Rebecca, Chang Sam, Berkowitz Sean, Finn Avni, Jahangir Eiman, Scoville Elizabeth, Reese Tyler, Friedman Debra, Bastarache Julie, van der Heijden Yuri, Wright Jordan, Carter Nicholas, Alexander Matthew, Choe Jennifer, Chastain Cody, Zic John, Horst Sara, Turker Isik, Agarwal Rajiv, Osmundson Evan, Idrees Kamran, Kieman Colleen, Padmanabhan Chandrasekhar, Bailey Christina, Schlegel Cameron, Chambless Lola, Gibson Mike, Osterman Travis, Wheless Lee

机构信息

Vanderbilt University Medical Center.

Vanderbilt University School of Medicine.

出版信息

Res Sq. 2023 Feb 28:rs.3.rs-2566942. doi: 10.21203/rs.3.rs-2566942/v1.

DOI:10.21203/rs.3.rs-2566942/v1

PMID:36909565

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10002821/

Abstract

BACKGROUND

Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known.

METHODS

Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 - completely incorrect to 6 - completely correct) and completeness (3-point Likert scale; range 1 - incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing.

RESULTS

Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01).

CONCLUSIONS

ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.

摘要

背景

诸如ChatGPT之类的自然语言处理模型可以生成基于文本的内容，并有望成为医学及其他领域的主要信息来源。ChatGPT针对医学问题的准确性和完整性尚不清楚。

方法

来自17个专业的33名医生提出了284个医学问题，他们主观地将这些问题分类为简单、中等或困难，并给出二元（是/否）或描述性答案。然后，医生们对ChatGPT针对这些问题生成的答案进行准确性（6点李克特量表；范围从1 - 完全错误到6 - 完全正确）和完整性（3点李克特量表；范围从1 - 不完整到3 - 完整并附带额外背景信息）评分。分数通过描述性统计进行汇总，并使用曼-惠特尼U检验或克鲁斯卡尔-沃利斯检验进行比较。

结果

在所有问题（n = 284）中，准确性中位数得分为5.5（介于几乎完全正确和完全正确之间），平均得分为4.8（介于大部分正确和几乎完全正确之间）。完整性中位数得分为3（完整且全面），平均得分为2.5。对于评为简单、中等和困难的问题，准确性中位数得分分别为6、5.5和5（平均分别为5.0、4.7和4.6；p = 0.05）。二元问题和描述性问题的准确性得分相似（中位数分别为6对5；平均分别为4.9对4.7；p = 0.07）。在36个得分为1 - 2的问题中，34个在8 - 17天后重新提问/重新评分，有显著改善（中位数分别为2对4；p < 0.01）。

结论

尽管存在重要局限性，但学术医生专家判断ChatGPT针对各种医学问题生成的信息在很大程度上是准确的。需要进一步的研究和模型开发来纠正不准确之处并进行验证。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d86d/10002821/282b5a97befa/nihpp-rs2566942v1-f0001.jpg

相似文献

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model.评估人工智能生成的医学回复的准确性和可靠性：对Chat-GPT模型的评估

Res Sq. 2023 Feb 28:rs.3.rs-2566942. doi: 10.21203/rs.3.rs-2566942/v1.

Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。

JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI.评估药物流产信息的准确性：ChatGPT与谷歌巴德人工智能的比较分析

Cureus. 2024 Jan 2;16(1):e51544. doi: 10.7759/cureus.51544. eCollection 2024 Jan.

Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model.评估牙科领域人工智能生成回复的准确性、完整性和可靠性：一项评估ChatGPT模型的试点研究

Cureus. 2024 Jul 29;16(7):e65658. doi: 10.7759/cureus.65658. eCollection 2024 Jul.

Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer?ChatGPT在回答有关头颈癌的问题时准确可靠吗？

Front Oncol. 2023 Dec 1;13:1256459. doi: 10.3389/fonc.2023.1256459. eCollection 2023.

Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT.探讨一款关于葡萄膜炎的人工智能大语言模型的准确性和完整性：ChatGPT 的评估。

Ocul Immunol Inflamm. 2024 Nov;32(9):2052-2055. doi: 10.1080/09273948.2024.2317417. Epub 2024 Feb 23.

Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines.ChatGPT 为炎症性肠病患者提供的信息与 ECCO 指南的准确性比较。

J Crohns Colitis. 2024 Aug 14;18(8):1215-1221. doi: 10.1093/ecco-jcc/jjae040.

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性：评估研究

JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现：调查研究。

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

Assessing the Capability of ChatGPT in Answering First- and Second-Order Knowledge Questions on Microbiology as per Competency-Based Medical Education Curriculum.根据基于能力的医学教育课程评估ChatGPT回答微生物学一阶和二阶知识问题的能力。

Cureus. 2023 Mar 12;15(3):e36034. doi: 10.7759/cureus.36034. eCollection 2023 Mar.

引用本文的文献

Battle of the artificial intelligence: a comprehensive comparative analysis of DeepSeek and ChatGPT for urinary incontinence-related questions.人工智能之战：针对尿失禁相关问题对DeepSeek和ChatGPT的全面比较分析

Front Public Health. 2025 Jul 23;13:1605908. doi: 10.3389/fpubh.2025.1605908. eCollection 2025.

Users' Needs for Mental Health Apps: Quality Evaluation Using the User Version of the Mobile Application Rating Scale.用户对心理健康应用程序的需求：使用移动应用程序评分量表用户版进行质量评估。

JMIR Mhealth Uhealth. 2025 Jul 4;13:e64622. doi: 10.2196/64622.

The Diagnostic Performance of Large Language Models and Oral Medicine Consultants for Identifying Oral Lesions in Text-Based Clinical Scenarios: Prospective Comparative Study.大语言模型与口腔医学顾问在基于文本的临床场景中识别口腔病变的诊断性能：前瞻性比较研究

JMIR AI. 2025 Apr 24;4:e70566. doi: 10.2196/70566.

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.评估大型语言模型回答临床问题的可信度：横断面评估研究

JMIR Med Inform. 2025 May 16;13:e66917. doi: 10.2196/66917.

Evaluating Generative AI in Mental Health: Systematic Review of Capabilities and Limitations.评估生成式人工智能在心理健康领域的应用：能力与局限的系统综述

JMIR Ment Health. 2025 May 15;12:e70014. doi: 10.2196/70014.

Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care.利用人工智能变革牙科诊断：ChatGPT与大语言模型在患者护理中的深度整合

Front Dent Med. 2025 Jan 6;5:1456208. doi: 10.3389/fdmed.2024.1456208. eCollection 2024.

The Goldilocks Zone: Finding the right balance of user and institutional risk for suicide-related generative AI queries.适居带：为与自杀相关的生成式人工智能查询找到用户风险与机构风险的恰当平衡。

PLOS Digit Health. 2025 Jan 8;4(1):e0000711. doi: 10.1371/journal.pdig.0000711. eCollection 2025 Jan.

Large language model answers medical questions about standard pathology reports.大型语言模型回答有关标准病理报告的医学问题。

Front Med (Lausanne). 2024 Sep 18;11:1402457. doi: 10.3389/fmed.2024.1402457. eCollection 2024.

Perceptions of ChatGPT in healthcare: usefulness, trust, and risk.医疗保健领域对 ChatGPT 的认知：实用性、信任度和风险。

Front Public Health. 2024 Sep 13;12:1457131. doi: 10.3389/fpubh.2024.1457131. eCollection 2024.

Impact of Large Language Models on Medical Education and Teaching Adaptations.大语言模型对医学教育及教学适应性的影响

JMIR Med Inform. 2024 Jul 25;12:e55933. doi: 10.2196/55933.

本文引用的文献

On the cusp: Considering the impact of artificial intelligence language models in healthcare.处于临界点：思考人工智能语言模型在医疗保健领域的影响。

Med. 2023 Mar 10;4(3):139-140. doi: 10.1016/j.medj.2023.02.008.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Nonhuman "Authors" and Implications for the Integrity of Scientific Publication and Medical Knowledge.非人类“作者”以及对科学出版物和医学知识完整性的影响。

JAMA. 2023 Feb 28;329(8):637-639. doi: 10.1001/jama.2023.1344.

ChatGPT is fun, but not an author.ChatGPT 很有趣，但不是作者。

Science. 2023 Jan 27;379(6630):313. doi: 10.1126/science.adg7879. Epub 2023 Jan 26.

ChatGPT and Other Large Language Models Are Double-edged Swords.ChatGPT和其他大型语言模型是双刃剑。

Radiology. 2023 Apr;307(2):e230163. doi: 10.1148/radiol.230163. Epub 2023 Jan 26.

Using AI to write scholarly publications.使用人工智能撰写学术出版物。

Account Res. 2024 Oct;31(7):715-723. doi: 10.1080/08989621.2023.2168535. Epub 2023 Jan 25.

ChatGPT listed as author on research papers: many scientists disapprove.研究论文将ChatGPT列为作者：许多科学家表示反对。

Nature. 2023 Jan;613(7945):620-621. doi: 10.1038/d41586-023-00107-z.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估人工智能生成的医学回复的准确性和可靠性：对Chat-GPT模型的评估

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献