Meyer Monica K Rossi, Kandathil Cherian Kurian, Davis Seth J, Durairaj K Kay, Patel Priyesh N, Pepper Jon-Paul, Spataro Emily A, Most Sam P
Division of Facial Plastic and Reconstructive Surgery, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA.
Department of Otolaryngology, Head and Neck Surgery, Huntington Hospital, Pasadena, California, USA.
Aesthetic Plast Surg. 2025 Apr;49(7):1868-1873. doi: 10.1007/s00266-024-04343-0. Epub 2024 Sep 16.
Assessment of the readability, accuracy, quality, and completeness of ChatGPT (Open AI, San Francisco, CA), Gemini (Google, Mountain View, CA), and Claude (Anthropic, San Francisco, CA) responses to common questions about rhinoplasty.
Ten questions commonly encountered in the senior author's (SPM) rhinoplasty practice were presented to ChatGPT-4, Gemini and Claude. Seven Facial Plastic and Reconstructive Surgeons with experience in rhinoplasty were asked to evaluate these responses for accuracy, quality, completeness, relevance, and use of medical jargon on a Likert scale. The responses were also evaluated using several readability indices.
ChatGPT achieved significantly higher evaluator scores for accuracy, and overall quality but scored significantly lower on completeness compared to Gemini and Claude. All three chatbot responses to the ten questions were rated as neutral to incomplete. All three chatbots were found to use medical jargon and scored at a college reading level for readability scores.
Rhinoplasty surgeons should be aware that the medical information found on chatbot platforms is incomplete and still needs to be scrutinized for accuracy. However, the technology does have potential for use in healthcare education by training it on evidence-based recommendations and improving readability.
This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
评估ChatGPT(OpenAI,加利福尼亚州旧金山)、Gemini(谷歌,加利福尼亚州山景城)和Claude(Anthropic,加利福尼亚州旧金山)对隆鼻常见问题的回答的可读性、准确性、质量和完整性。
向ChatGPT-4、Gemini和Claude提出资深作者(SPM)隆鼻实践中常见的10个问题。邀请7位有隆鼻经验的面部整形和重建外科医生,按照李克特量表对这些回答的准确性、质量、完整性、相关性和医学术语的使用进行评估。还使用了几个可读性指标对回答进行评估。
与Gemini和Claude相比,ChatGPT在准确性和整体质量方面获得了显著更高的评估分数,但在完整性方面得分显著更低。三个聊天机器人对这10个问题的回答均被评为中性至不完整。发现所有三个聊天机器人都使用医学术语,可读性得分处于大学阅读水平。
隆鼻外科医生应意识到,在聊天机器人平台上找到的医学信息是不完整的,仍需仔细审查其准确性。然而,通过基于循证推荐对其进行训练并提高可读性,这项技术在医疗保健教育中确实具有应用潜力。
证据水平V:本刊要求作者为每篇文章指定一个证据水平。有关这些循证医学评级的完整描述,请参阅目录或作者在线指南www.springer.com/00266 。