Nechay T V, Sazhin A V, Loban K M, Bogomolova A K, Suglob V V, Beniia T R
Pirogov Russian National Research Medical University, Moscow, Russia.
Khirurgiia (Mosk). 2024(8):6-14. doi: 10.17116/hirurgia20240816.
To evaluate the quality of recommendations provided by ChatGPT regarding inguinal hernia repair.
ChatGPT was asked 5 questions about surgical management of inguinal hernias. The chat-bot was assigned the role of expert in herniology and requested to search only specialized medical databases and provide information about references and evidence. Herniology experts and surgeons (non-experts) rated the quality of recommendations generated by ChatGPT using 4-point scale (from 0 to 3 points). Statistical correlations were explored between participants' ratings and their stance regarding artificial intelligence.
Experts scored the quality of ChatGPT responses lower than non-experts (2 (1-2) vs. 2 (2-3), <0.001). The chat-bot failed to provide valid references and actual evidence, as well as falsified half of references. Respondents were optimistic about the future of neural networks for clinical decision-making support. Most of them were against restricting their use in healthcare.
We would not recommend non-specialized large language models as a single or primary source of information for clinical decision making or virtual searching assistant.
评估ChatGPT提供的关于腹股沟疝修补术建议的质量。
向ChatGPT询问了5个关于腹股沟疝手术治疗的问题。将聊天机器人设定为疝科学专家的角色,并要求其仅在专业医学数据库中进行搜索,并提供参考文献和证据方面的信息。疝科学专家和外科医生(非专家)使用4分制(从0到3分)对ChatGPT生成的建议质量进行评分。探讨了参与者评分与其对人工智能的态度之间的统计相关性。
专家对ChatGPT回复质量的评分低于非专家(2(1 - 2)对2(2 - 3),<0.001)。聊天机器人未能提供有效的参考文献和实际证据,并且伪造了一半的参考文献。受访者对神经网络用于临床决策支持的未来持乐观态度。他们中的大多数人反对在医疗保健中限制其使用。
我们不建议将非专业的大语言模型作为临床决策或虚拟搜索助手的唯一或主要信息来源。