文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

大语言模型在良性前列腺增生常见问题解答方面的表现。

Performance of large language models on benign prostatic hyperplasia frequently asked questions.

机构信息

Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China.

College of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

出版信息

Prostate. 2024 Jun;84(9):807-813. doi: 10.1002/pros.24699. Epub 2024 Apr 1.


DOI:10.1002/pros.24699
PMID:38558009
Abstract

BACKGROUND: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire. METHODS: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists. RESULTS: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%. CONCLUSIONS: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

摘要

背景:良性前列腺增生(BPH)是一种常见病症,但对于普通 BPH 患者来说,找到关于 BPH 的可信且准确的信息具有一定挑战性。我们的目标是评估和比较大型语言模型(LLM),包括 ChatGPT-3.5、ChatGPT-4 和新版必应聊天,在回答 BPH 常见问题(FAQ)问卷方面的准确性和可重复性。

方法:总共 45 个与 BPH 相关的问题被分为基础知识和专业知识两类。使用三个 LLM-ChatGPT-3.5、ChatGPT-4 和新版必应聊天来生成对这些问题的回答。回答被评为全面、正确但不充分、混合错误/过时数据、或完全错误。通过为每个问题生成两个回答来评估可重复性。所有回答均由经验丰富的泌尿科医生进行审查和判断。

结果:所有三个 LLM 在生成回答问题方面表现出较高的准确性,准确率在 86.7%至 100%之间。然而,在三个模型之间的回答准确性没有统计学上的显著差异(所有比较 p>0.017)。此外,LLM 对基础知识问题的回答准确性与专业知识问题的准确性大致相当,差异小于 3.5%(GPT-3.5:90%比 86.7%;GPT-4:96.7%比 95.6%;新版必应:96.7%比 93.3%)。此外,所有三个 LLM 都表现出较高的可重复性,准确率在 93.3%至 97.8%之间。

结论:ChatGPT-3.5、ChatGPT-4 和新版必应聊天能够为 BPH 相关问题提供准确且可重复的回答,为提高健康素养和在医疗保健专业人员的配合下支持 BPH 患者提供了有价值的资源。

相似文献

[1]
Performance of large language models on benign prostatic hyperplasia frequently asked questions.

Prostate. 2024-6

[2]
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023-12-28

[3]
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024-4-13

[4]
Dr. Google vs. Dr. ChatGPT: Exploring the Use of Artificial Intelligence in Ophthalmology by Comparing the Accuracy, Safety, and Readability of Responses to Frequently Asked Patient Questions Regarding Cataracts and Cataract Surgery.

Semin Ophthalmol. 2024-8

[5]
Clinical application potential of large language model: a study based on thyroid nodules.

Endocrine. 2025-1

[6]
Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources.

Surg Endosc. 2024-5

[7]
Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study.

JMIR Dermatol. 2024-5-16

[8]
Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.

Rheumatol Int. 2024-3

[9]
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.

Surg Obes Relat Dis. 2024-7

[10]
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.

Surg Obes Relat Dis. 2024-7

引用本文的文献

[1]
Systematic Review on Large Language Models in Orthopaedic Surgery.

J Clin Med. 2025-8-20

[2]
Artificial Intelligence for Individualized Radiological Dialogue: The Impact of RadioBot on Precision-Driven Medical Practices.

J Pers Med. 2025-8-8

[3]
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

J Med Internet Res. 2025-4-30

[4]
Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports.

Sci Rep. 2025-4-26

[5]
Large language models in patient education: a scoping review of applications in medicine.

Front Med (Lausanne). 2024-10-29

[6]
Use of artificial intelligence chatbots in clinical management of immune-related adverse events.

J Immunother Cancer. 2024-5-30

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索