Perrot Ophélie, Schirmann Aurelie, Vidart Adrien, Guillot-Tantay Cyrille, Izard Vincent, Lebret Thierry, Boillot Bernard, Mesnard Benoit, Lebacle Cedric, Madec François-Xavier
Foch Hospital, Urology department, Suresnes, France.
Foch Hospital, Urology department, Suresnes, France.
Fr J Urol. 2024 Jun;34(5):102636. doi: 10.1016/j.fjurol.2024.102636. Epub 2024 Apr 8.
AI-derived language models are booming, and their place in medicine is undefined. The aim of our study is to compare responses to andrology clinical cases, between chatbots and andrologists, to assess the reliability of these technologies.
We analyzed the responses of 32 experts, 18 residents and three chatbots (ChatGPT v3.5, v4 and Bard) to 25 andrology clinical cases. Responses were assessed on a Likert scale ranging from 0 to 2 for each question (0-false response or no response; 1-partially correct response, 2- correct response), on the basis of the latest national or, in the absence of such, international recommendations. We compared the averages obtained for all cases by the different groups.
Experts obtained a higher mean score (m=11/12.4 σ=1.4) than ChatGPT v4 (m=10.7/12.4 σ=2.2, p=0.6475), ChatGPT v3.5 (m=9.5/12.4 σ=2.1, p=0.0062) and Bard (m=7.2/12.4 σ=3.3, p<0.0001). Residents obtained a mean score (m=9.4/12.4 σ=1.7) higher than Bard (m=7.2/12.4 σ=3.3, p=0.0053) but lower than ChatGPT v3.5 (m=9.5/12.4 σ=2.1, p=0.8393) and v4 (m=10.7/12.4 σ=2.2, p=0.0183) and experts (m=11.0/12.4 σ=1.4,p=0.0009). ChatGPT v4 performance (m=10.7 σ=2.2) was better than ChatGPT v3.5 (m=9.5, σ=2.1, p=0.0476) and Bard performance (m=7.2 σ=3.3, p<0.0001).
The use of chatbots in medicine could be relevant. More studies are needed to integrate them into clinical practice.
人工智能驱动的语言模型正在蓬勃发展,它们在医学领域的地位尚不明确。我们研究的目的是比较聊天机器人和男科医生对男科临床病例的回答,以评估这些技术的可靠性。
我们分析了32名专家、18名住院医生和三个聊天机器人(ChatGPT v3.5、v4和Bard)对25个男科临床病例的回答。根据最新的国家指南(如无国家指南,则依据国际指南),对每个问题的回答按照从0到2的李克特量表进行评估(0-错误回答或无回答;1-部分正确回答;2-正确回答)。我们比较了不同组在所有病例上获得的平均分。
专家获得的平均分数(m=11/12.4,σ=1.4)高于ChatGPT v4(m=10.7/12.4,σ=2.2,p=0.6475)、ChatGPT v3.5(m=9.5/12.4,σ=2.1,p=0.0062)和Bard(m=7.2/12.4,σ=3.3,p<0.0001)。住院医生获得的平均分数(m=9.4/12.4,σ=1.7)高于Bard(m=7.2/12.4,σ=3.3,p=0.0053),但低于ChatGPT v3.5(m=9.5/12.4,σ=2.1,p=0.8393)、ChatGPT v4(m=10.7/12.4,σ=2.2,p=0.0183)和专家(m=11.0/12.4,σ=1.4,p=0.0009)。ChatGPT v4的表现(m=10.7,σ=2.2)优于ChatGPT v3.5(m=9.5,σ=2.1,p=0.0476)和Bard的表现(m=7.2,σ=3.3,p<0.0001)。
在医学中使用聊天机器人可能具有重要意义。需要更多研究将它们整合到临床实践中。