Suppr超能文献

评估不同大型语言模型在泌尿系结石健康咨询和患者教育中的表现。

Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis.

机构信息

Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China.

Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China.

出版信息

J Med Syst. 2023 Nov 24;47(1):125. doi: 10.1007/s10916-023-02021-3.

Abstract

OBJECTIVES

To evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis.

MATERIALS AND METHODS

In this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance.

RESULTS

All models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education.

CONCLUSION

Claude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.

摘要

目的

评估四个具有庞大用户基础和显著社会关注度的大型语言模型(Claude、Bard、ChatGPT4 和 New Bing)在尿石症医学咨询和患者教育方面的有效性。

材料与方法

本研究开发了一个包含 21 个问题和 2 个与尿石症相关临床情景的问卷。随后,对每个模型进行了模拟临床咨询,以评估其对问题的回答。根据预先设计的 5 分李克特量表,尿石症专家评估模型回答的准确性、全面性、易懂性、人文关怀和临床病例分析能力。然后使用可视化和统计分析比较了四个模型并评估了它们的性能。

结果

除 Bard 外,所有模型均表现出令人满意的性能,Bard 未能对第 13 个问题提供有效回答。Claude 在所有维度上的得分均高于其他三个模型,始终排名第一。ChatGPT4 在准确性方面排名第二,在多次测试中输出相对稳定,但在同理心和人文关怀方面存在不足。Bard 的准确性和整体表现最低。Claude 和 ChatGPT4 都具有分析尿石症临床病例的高能力。总体而言,Claude 在尿石症咨询和教育方面表现最佳。

结论

Claude 在尿石症咨询和教育方面的表现优于其他三个模型。本研究强调了大型语言模型在医学健康咨询和患者教育方面的巨大潜力,但仍需要专业审查、进一步评估和修改。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验