Roch Friederike Eva, Hahn Franziska Melanie, Jäckle Katharina, Meier Marc-Pascal, Stinus Hartmut, Lehmann Wolfgang, Perthel Ronny, Roch Paul Jonathan
Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
Foot Ankle Surg. 2025 Jun;31(4):329-351. doi: 10.1016/j.fas.2024.12.003. Epub 2024 Dec 13.
Free chatbots powered by large language models offer lateral ankle sprains (LAS) treatment recommendations but lack scientific validation.
The chatbots-Claude, Perplexity, and ChatGPT-were evaluated by comparing their responses to a questionnaire and their treatment algorithms against current clinical guidelines. Responses were graded on accuracy, conclusiveness, supplementary information, and incompleteness, and evaluated individually and collectively, with a 60 % pass threshold.
The collective analysis of the questionnaire showed Perplexity scored significantly higher than Claude and ChatGPT (p < 0.001). In the individual analysis, Perplexity provided significantly more supplementary information than the other chatbots (p < 0.001). All chatbots met the pass threshold. In the algorithm evaluation, ChatGPT scored significantly higher than the others (p = 0.023), with Perplexity below the pass threshold.
Chatbots' recommendations generally aligned with current guidelines but sometimes missed crucial details. While they offer useful supplementary information, they cannot yet replace professional medical consultation or established guidelines.
由大语言模型驱动的免费聊天机器人提供外侧踝关节扭伤(LAS)的治疗建议,但缺乏科学验证。
通过将聊天机器人Claude、Perplexity和ChatGPT对问卷的回答及其治疗算法与当前临床指南进行比较来评估它们。回答根据准确性、结论性、补充信息和不完整性进行评分,并分别和综合进行评估,及格阈值为60%。
问卷的综合分析显示,Perplexity的得分显著高于Claude和ChatGPT(p<0.001)。在个体分析中,Perplexity提供的补充信息显著多于其他聊天机器人(p<0.001)。所有聊天机器人均达到及格阈值。在算法评估中,ChatGPT的得分显著高于其他机器人(p=0.023),Perplexity低于及格阈值。
聊天机器人的建议总体上与当前指南一致,但有时会遗漏关键细节。虽然它们提供了有用的补充信息,但尚不能取代专业医疗咨询或既定指南。