大语言模型在良性前列腺增生常见问题解答方面的表现。

BACKGROUND: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire. METHODS: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists. RESULTS: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%. CONCLUSIONS: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

背景：良性前列腺增生（BPH）是一种常见病症，但对于普通 BPH 患者来说，找到关于 BPH 的可信且准确的信息具有一定挑战性。我们的目标是评估和比较大型语言模型（LLM），包括 ChatGPT-3.5、ChatGPT-4 和新版必应聊天，在回答 BPH 常见问题（FAQ）问卷方面的准确性和可重复性。

方法：总共 45 个与 BPH 相关的问题被分为基础知识和专业知识两类。使用三个 LLM-ChatGPT-3.5、ChatGPT-4 和新版必应聊天来生成对这些问题的回答。回答被评为全面、正确但不充分、混合错误/过时数据、或完全错误。通过为每个问题生成两个回答来评估可重复性。所有回答均由经验丰富的泌尿科医生进行审查和判断。

结果：所有三个 LLM 在生成回答问题方面表现出较高的准确性，准确率在 86.7%至 100%之间。然而，在三个模型之间的回答准确性没有统计学上的显著差异（所有比较 p>0.017）。此外，LLM 对基础知识问题的回答准确性与专业知识问题的准确性大致相当，差异小于 3.5%（GPT-3.5：90%比 86.7%；GPT-4：96.7%比 95.6%；新版必应：96.7%比 93.3%）。此外，所有三个 LLM 都表现出较高的可重复性，准确率在 93.3%至 97.8%之间。

结论：ChatGPT-3.5、ChatGPT-4 和新版必应聊天能够为 BPH 相关问题提供准确且可重复的回答，为提高健康素养和在医疗保健专业人员的配合下支持 BPH 患者提供了有价值的资源。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Performance of large language models on benign prostatic hyperplasia frequently asked questions.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

推荐工具