• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在全科医疗中使用应用知识测试对大型语言模型(ChatGPT)进行试验:观察性研究揭示初级保健中的机遇与局限

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.

作者信息

Thirunavukarasu Arun James, Hassan Refaat, Mahmood Shathar, Sanghera Rohan, Barzangi Kara, El Mukashfi Mohanned, Shah Sachin

机构信息

University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom.

Attenborough Surgery, Bushey Medical Centre, Bushey, United Kingdom.

出版信息

JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599.

DOI:10.2196/46599
PMID:37083633
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10163403/
Abstract

BACKGROUND

Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners.

OBJECTIVE

Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium.

METHODS

AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model's answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners' reports from 2018 to 2022. Novel explanations from ChatGPT-defined as information provided that was not inputted within the question or multiple answer choices-were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT's strengths and weaknesses.

RESULTS

Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT's performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=-0.241 and -0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23).

CONCLUSIONS

Large language models are approaching human expert-level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.

摘要

背景

在专门任务中展现出人类水平表现的大型语言模型正在涌现;例如生成式预训练变换器3.5,它是ChatGPT处理的基础。需要进行严格试验以了解新兴技术的能力,从而引导创新造福患者和从业者。

目的

在此,我们以皇家全科医师学院应用知识测试(AKT)为媒介,评估ChatGPT在初级保健中的优势和劣势。

方法

AKT问题来自基于网络的题库和两份AKT练习题。总共向ChatGPT输入了674个独特的AKT问题,记录模型的答案并与皇家全科医师学院提供的正确答案进行比较。每个问题在单独的ChatGPT会话中输入两次,比较重复试验的答案以评估一致性。通过参考2018年至2022年考官报告来衡量题目难度。记录ChatGPT给出的新颖解释,即问题或多个答案选项中未输入的信息。从题目、难度、问题来源和新颖的模型输出方面分析表现,以探索ChatGPT的优势和劣势。

结果

ChatGPT的平均总体表现为60.17%,低于过去两年的平均及格分数(70.42%)。不同来源的准确率存在差异(P = 0.04和0.06)。ChatGPT的表现因题目类别而异(P = 0.02和0.02),但这种差异与难度无关(斯皮尔曼ρ = -0.241和 -0.238;P = 0.19和0.20)。ChatGPT提供新颖解释的倾向不影响准确率(P > 0.99和0.23)。

结论

大型语言模型正在接近人类专家水平的表现,不过仍需要进一步发展以在AKT中达到合格初级保健医生的表现。经过验证的高性能模型可作为助手或自主临床工具,以缓解全科医疗劳动力危机。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/690e2fcf9750/mededu_v9i1e46599_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/d6d4c528c252/mededu_v9i1e46599_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/f37e63cba8fa/mededu_v9i1e46599_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/5e1b3186e05a/mededu_v9i1e46599_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/690e2fcf9750/mededu_v9i1e46599_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/d6d4c528c252/mededu_v9i1e46599_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/f37e63cba8fa/mededu_v9i1e46599_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/5e1b3186e05a/mededu_v9i1e46599_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/869d/10163403/690e2fcf9750/mededu_v9i1e46599_fig4.jpg

相似文献

1
Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.在全科医疗中使用应用知识测试对大型语言模型(ChatGPT)进行试验:观察性研究揭示初级保健中的机遇与局限
JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599.
2
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响:来自台湾护理执照考试的见解。
Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.
3
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
4
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.
5
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现:调查研究。
JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.
6
Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers.揭开ChatGPT现象的面纱:评估牙髓病学问题答案的一致性和准确性。
Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985. Epub 2023 Oct 9.
7
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.评估 ChatGPT 在整个临床工作流程中的效用:开发和可用性研究。
J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.
8
The Accuracy of Artificial Intelligence ChatGPT in Oncology Examination Questions.人工智能 ChatGPT 在肿瘤学检查问题中的准确性。
J Am Coll Radiol. 2024 Nov;21(11):1800-1804. doi: 10.1016/j.jacr.2024.07.011. Epub 2024 Aug 2.
9
Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine.评估 ChatGPT 在日本内科学中的有效性和倾向。
J Eval Clin Pract. 2024 Sep;30(6):1017-1023. doi: 10.1111/jep.14011. Epub 2024 May 19.
10
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现:对其优缺点的分析。
Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

引用本文的文献

1
Current trends and future prospects of language models and processing systems in spine surgery - a scoping review.脊柱手术中语言模型和处理系统的当前趋势与未来前景——一项范围综述
Neurosurg Rev. 2025 Sep 5;48(1):633. doi: 10.1007/s10143-025-03785-7.
2
Promoting trust and intention to adopt health information generated by ChatGPT among healthcare customers: An empirical study.促进医疗保健客户对ChatGPT生成的健康信息的信任和采用意愿:一项实证研究。
Digit Health. 2025 Aug 28;11:20552076251374121. doi: 10.1177/20552076251374121. eCollection 2025 Jan-Dec.
3
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.

本文引用的文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
2
Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study.对ChatGPT的医学建议进行(图灵)测试:调查研究。
JMIR Med Educ. 2023 Jul 10;9:e46939. doi: 10.2196/46939.
3
Accuracy and reliability of self-administered visual acuity tests: Systematic review of pragmatic trials.自我管理视力测试的准确性和可靠性:实用试验的系统评价。
减轻大语言模型在回答医学问题时过度自信的令牌概率:定量研究
J Med Internet Res. 2025 Aug 29;27:e64348. doi: 10.2196/64348.
4
Comprehensive application of artificial intelligence in colorectal cancer: A review.人工智能在结直肠癌中的综合应用:综述
iScience. 2025 Jun 23;28(7):112980. doi: 10.1016/j.isci.2025.112980. eCollection 2025 Jul 18.
5
Medical Students' Perceptions of Large Language Models in Healthcare: A Multinational Cross-Sectional Study.医学生对医疗保健领域大语言模型的认知:一项跨国横断面研究。
J Med Educ Curric Dev. 2025 May 21;12:23821205251331124. doi: 10.1177/23821205251331124. eCollection 2025 Jan-Dec.
6
Evaluating the Potential of ChatGPT as a Supplementary Intelligent Virtual Assistant in Periodontology.评估ChatGPT作为牙周病学辅助智能虚拟助手的潜力。
J Pharm Bioallied Sci. 2025 Jun;17(Suppl 2):S1415-S1417. doi: 10.4103/jpbs.jpbs_1727_24. Epub 2025 Jun 18.
7
ChatGPT versus DeepSeek in head and neck cancer staging and treatment planning: guideline-based study.ChatGPT与DeepSeek在头颈癌分期及治疗规划中的比较:基于指南的研究
Eur Arch Otorhinolaryngol. 2025 Jun 17. doi: 10.1007/s00405-025-09524-4.
8
Opportunities, challenges, and requirements for Artificial Intelligence (AI) implementation in Primary Health Care (PHC): a systematic review.初级卫生保健(PHC)中实施人工智能(AI)的机遇、挑战和要求:一项系统综述
BMC Prim Care. 2025 Jun 9;26(1):196. doi: 10.1186/s12875-025-02785-2.
9
Comparison of hand surgery certification exams in Europe and the United States using ChatGPT 4.0.使用ChatGPT 4.0对欧美手部外科认证考试进行比较。
J Hand Microsurg. 2025 May 5;17(4):100258. doi: 10.1016/j.jham.2025.100258. eCollection 2025 Jul.
10
A Primer on Large Language Models (LLMs) and ChatGPT for Cardiovascular Healthcare Professionals.心血管医疗专业人员的大语言模型(LLMs)和ChatGPT入门指南。
CJC Open. 2025 Feb 20;7(5):660-666. doi: 10.1016/j.cjco.2025.02.012. eCollection 2025 May.
PLoS One. 2023 Jun 22;18(6):e0281847. doi: 10.1371/journal.pone.0281847. eCollection 2023.
4
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
5
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
6
Workload and workflow implications associated with the use of electronic clinical decision support tools used by health professionals in general practice: a scoping review.工作负荷和工作流程对一般实践中医疗专业人员使用电子临床决策支持工具的影响:范围综述。
BMC Prim Care. 2023 Jan 20;24(1):23. doi: 10.1186/s12875-023-01973-2.
7
Examining Mental Workload Relating to Digital Health Technologies in Health Care: Systematic Review.考察医疗保健中与数字健康技术相关的心理工作量:系统评价。
J Med Internet Res. 2022 Oct 28;24(10):e40946. doi: 10.2196/40946.
8
General Practice in England: The Current Crisis, Opportunities, and Challenges.英国的全科医学:当前危机、机遇与挑战。
J Ambul Care Manage. 2022;45(2):135-139. doi: 10.1097/JAC.0000000000000410.
9
Health-focused conversational agents in person-centered care: a review of apps.以患者为中心的护理中注重健康的对话代理:应用程序综述
NPJ Digit Med. 2022 Feb 17;5(1):21. doi: 10.1038/s41746-022-00560-6.
10
General Practitioners' Attitudes Toward Artificial Intelligence-Enabled Systems: Interview Study.全科医生对人工智能系统的态度:访谈研究。
J Med Internet Res. 2022 Jan 27;24(1):e28916. doi: 10.2196/28916.