Suppr超能文献

大语言模型在女性盆底重建外科学中的性能比较分析

Comparative Analysis of Performance of Large Language Models in Urogynecology.

作者信息

Yadav Ghanshyam S, Pandit Kshitij, Connell Phillip T, Erfani Hadi, Nager Charles W

机构信息

From the Division of Urogynecology and Reconstructive Pelvic Surgery, UC San Diego, San Diego, CA.

Department of Urology, UC San Diego School of Medicine, La Jolla, CA.

出版信息

Urogynecology (Phila). 2025 Jul 1;31(7):713-719. doi: 10.1097/SPV.0000000000001545. Epub 2024 Jun 27.

Abstract

IMPORTANCE

Despite growing popularity in medicine, data on large language models in urogynecology are lacking.

OBJECTIVE

The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination.

STUDY DESIGN

The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or χ 2 test was used for statistical analysis.

RESULTS

Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard.

CONCLUSIONS

Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.

摘要

重要性

尽管大语言模型在医学领域越来越受欢迎,但关于其在女性盆底重建外科学的应用数据仍很匮乏。

目的

本研究旨在比较ChatGPT-3.5、GPT-4和Bard在美国女性盆底重建外科学会自我评估考试中的表现。

研究设计

该考试共有185道题,及格分数为80分。我们在每道题上测试了ChatGPT-3.5、GPT-4和Bard这三个模型。通过专用账户进行对照比较。将带有提示的问题输入每个模型的界面,并对回答的正确性、答案选择背后的逻辑推理以及来源进行评估。记录了关于子类别、问题类型、正确率、问题难度和参考文献质量的数据。采用Fisher精确检验或χ²检验进行统计分析。

结果

在185道题中,GPT-4答对了61.6%的问题,而ChatGPT-3.5为54.6%,Bard为42.7%。GPT-4回答了所有问题,而ChatGPT-3.5和Bard分别拒绝回答4道和25道问题。所有模型在正确回答中都表现出了逻辑推理。所有大语言模型的表现与问题的难度水平成反比。Bard有97.5%的时间引用了来源,比GPT-4(83.3%)和ChatGPT-3.5(39%)更频繁。ChatGPT-3.5引用了书籍和网站,而GPT-4和Bard还额外引用了期刊文章和学会指南。GPT-4引用的期刊影响因子中位数和引用次数分别为3.6和20次,Bard为2.6和25次。

结论

尽管GPT-4的表现优于ChatGPT-3.5和Bard,但没有一个模型达到及格分数。在有更多证据出现之前,临床医生在患者护理场景中应谨慎使用语言模型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验