聊天机器人与男科学专家对比：对25个临床病例进行测试

Chatbots vs andrologists: Testing 25 clinical cases.

作者信息

Perrot Ophélie, Schirmann Aurelie, Vidart Adrien, Guillot-Tantay Cyrille, Izard Vincent, Lebret Thierry, Boillot Bernard, Mesnard Benoit, Lebacle Cedric, Madec François-Xavier

机构信息

Foch Hospital, Urology department, Suresnes, France.

出版信息

Fr J Urol. 2024 Jun;34(5):102636. doi: 10.1016/j.fjurol.2024.102636. Epub 2024 Apr 8.

DOI:10.1016/j.fjurol.2024.102636

PMID:38599321

Abstract

OBJECTIVE

AI-derived language models are booming, and their place in medicine is undefined. The aim of our study is to compare responses to andrology clinical cases, between chatbots and andrologists, to assess the reliability of these technologies.

MATERIAL AND METHOD

We analyzed the responses of 32 experts, 18 residents and three chatbots (ChatGPT v3.5, v4 and Bard) to 25 andrology clinical cases. Responses were assessed on a Likert scale ranging from 0 to 2 for each question (0-false response or no response; 1-partially correct response, 2- correct response), on the basis of the latest national or, in the absence of such, international recommendations. We compared the averages obtained for all cases by the different groups.

RESULTS

Experts obtained a higher mean score (m=11/12.4 σ=1.4) than ChatGPT v4 (m=10.7/12.4 σ=2.2, p=0.6475), ChatGPT v3.5 (m=9.5/12.4 σ=2.1, p=0.0062) and Bard (m=7.2/12.4 σ=3.3, p<0.0001). Residents obtained a mean score (m=9.4/12.4 σ=1.7) higher than Bard (m=7.2/12.4 σ=3.3, p=0.0053) but lower than ChatGPT v3.5 (m=9.5/12.4 σ=2.1, p=0.8393) and v4 (m=10.7/12.4 σ=2.2, p=0.0183) and experts (m=11.0/12.4 σ=1.4,p=0.0009). ChatGPT v4 performance (m=10.7 σ=2.2) was better than ChatGPT v3.5 (m=9.5, σ=2.1, p=0.0476) and Bard performance (m=7.2 σ=3.3, p<0.0001).

CONCLUSION

The use of chatbots in medicine could be relevant. More studies are needed to integrate them into clinical practice.

摘要

目的

人工智能驱动的语言模型正在蓬勃发展，它们在医学领域的地位尚不明确。我们研究的目的是比较聊天机器人和男科医生对男科临床病例的回答，以评估这些技术的可靠性。

材料与方法

我们分析了32名专家、18名住院医生和三个聊天机器人（ChatGPT v3.5、v4和Bard）对25个男科临床病例的回答。根据最新的国家指南（如无国家指南，则依据国际指南），对每个问题的回答按照从0到2的李克特量表进行评估（0-错误回答或无回答；1-部分正确回答；2-正确回答）。我们比较了不同组在所有病例上获得的平均分。

结果

专家获得的平均分数（m=11/12.4，σ=1.4）高于ChatGPT v4（m=10.7/12.4，σ=2.2，p=0.6475）、ChatGPT v3.5（m=9.5/12.4，σ=2.1，p=0.0062）和Bard（m=7.2/12.4，σ=3.3，p<0.0001）。住院医生获得的平均分数（m=9.4/12.4，σ=1.7）高于Bard（m=7.2/12.4，σ=3.3，p=0.0053），但低于ChatGPT v3.5（m=9.5/12.4，σ=2.1，p=0.8393）、ChatGPT v4（m=10.7/12.4，σ=2.2，p=0.0183）和专家（m=11.0/12.4，σ=1.4，p=0.0009）。ChatGPT v4的表现（m=10.7，σ=2.2）优于ChatGPT v3.5（m=9.5，σ=2.1，p=0.0476）和Bard的表现（m=7.2，σ=3.3，p<0.0001）。

结论

在医学中使用聊天机器人可能具有重要意义。需要更多研究将它们整合到临床实践中。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

聊天机器人与男科学专家对比：对25个临床病例进行测试

Chatbots vs andrologists: Testing 25 clinical cases.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIAL AND METHOD

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

引用本文的文献

聊天机器人与男科学专家对比：对25个临床病例进行测试

Chatbots vs andrologists: Testing 25 clinical cases.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIAL AND METHOD

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

引用本文的文献