Suppr超能文献

评估人工智能语言模型在提供甲氨蝶呤使用信息方面的准确性和完整性。

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.

机构信息

Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey.

Department of Biostatistics, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey.

出版信息

Rheumatol Int. 2024 Mar;44(3):509-515. doi: 10.1007/s00296-023-05473-5. Epub 2023 Sep 25.

Abstract

We aimed to assess Large Language Models (LLMs)-ChatGPT 3.5-4, BARD, and Bing-in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the "mechanism of action" category, and GPT-4 model showed significantly higher accuracy than BARD in the "side effects" category. There were no statistically significant differences among the models for the "lifestyle" category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the "mechanism of action" category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the "side effects" and "lifestyle" categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.

摘要

我们旨在评估大型语言模型(LLMs)——ChatGPT 3.5-4、BARD 和 Bing——在回答治疗类风湿关节炎的甲氨蝶呤(MTX)相关问题时的准确性和完整性。我们使用了一项早期研究中与 MTX 相关的 23 个问题。这些问题被输入到 LLMs 中,每个模型生成的回复由两名评论员使用李克特量表进行评估,以评估准确性和完整性。GPT 模型的正确答案率达到了 100%,而 BARD 和 Bing 的得分分别为 73.91%。在输出的准确性(完全正确的回答)方面,GPT-4 得分为 100%,GPT 3.5 得分为 86.96%,BARD 和 Bing 各得 60.87%。BARD 产生了 17.39%的错误回答和 8.7%的无回答,而 Bing 记录了 13.04%的错误回答和 13.04%的无回答。ChatGPT 模型在“作用机制”类别中产生的回复明显比 Bing 更准确,GPT-4 模型在“副作用”类别中的准确性明显高于 BARD。在“生活方式”类别中,模型之间没有统计学上的显著差异。GPT-4 实现了全面输出的 100%,其次是 GPT-3.5 的 86.96%、BARD 的 60.86%和 Bing 的 0%。在“作用机制”类别中,ChatGPT 模型和 BARD 都产生了比 Bing 更全面的输出。对于“副作用”和“生活方式”类别,ChatGPT 模型显示出比 Bing 更高的完整性。GPT 模型,特别是 GPT 4,在提供关于 MTX 使用的准确和全面的患者信息方面表现出色。然而,该研究也发现生成回复存在不准确和不足之处。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验