Su Jieyan, Yang Xi, Li Xiangying, Chen Jiaxuan, Jiang Caixin, Wang Yi, Zhuang Le, Li Hang
Department of Dermatology, Central Hospital Affiliated to Shandong First Medical University, Jinan, People's Republic of China.
School of Clinical Medicine, Shandong Second Medical University, Weifang, People's Republic of China.
Clin Cosmet Investig Dermatol. 2025 Oct 23;18:2757-2767. doi: 10.2147/CCID.S552979. eCollection 2025.
Vitiligo causes significant psychological stress, creating a strong demand for accessible educational resources beyond clinical settings. This demand remains largely unmet. Large language models (LLMs) have the potential to bridge this gap by enhancing patient education. However, uncertainties exist regarding their ability to accurately address individualized patient inquiries and whether comprehension capabilities vary between LLMs.
This study aims to evaluate the applicability, accuracy, and potential limitations of OpenAI o1, DeepSeek-R1, and Grok 3 for vitiligo patient education.
Three dermatology experts first developed sixteen vitiligo-related questions based on common patient concerns, which were categorized as descriptive or recommendatory with basic and advanced levels. The responses from the three LLMs were then evaluated by three vitiligo-specialized dermatologists for accuracy, comprehensibility, and relevance using a Likert scale. Additionally, three patients rated the comprehensibility of the responses, and a readability analysis was performed.
All three LLMs demonstrated satisfactory accuracy, comprehensibility, and completeness, although their performance varied. They achieved 100% accuracy in responding to basic descriptive questions but exhibited inconsistency when addressing complex recommendatory queries, particularly regarding treatment recommendations for specific populations. Pairwise comparisons indicated that DeepSeek-R1 outperformed OpenAI o1 in accuracy scores (p = 0.042), while no significant difference was observed compared to Grok 3 (p = 0.157). Readability assessments revealed elevated reading difficulty across all models, with DeepSeek-R1 exhibiting the lowest readability (mean Flesch Reading Ease score of 19.7; pairwise comparisons showed DeepSeek-R1 scores were significantly lower than those of OpenAI o1 and Grok 3, both p < 0.01), potentially reducing accessibility for diverse patient populations.
Reasoning-LLMs demonstrate high accuracy in responding to simple vitiligo-related questions, but the quality of treatment recommendations declines as question complexity increases. Current models exhibit errors in providing vitiligo treatment advice, necessitating enhanced filtering mechanisms by developers and mandatory human oversight for medical decision-making.
白癜风会造成严重的心理压力,使得人们对临床环境之外易于获取的教育资源产生强烈需求。这种需求在很大程度上尚未得到满足。大语言模型有潜力通过加强患者教育来弥合这一差距。然而,对于它们准确解答患者个性化询问的能力以及不同大语言模型之间的理解能力是否存在差异,仍存在不确定性。
本研究旨在评估OpenAI o1、DeepSeek-R1和Grok 3在白癜风患者教育方面的适用性、准确性和潜在局限性。
三位皮肤科专家首先根据患者常见的担忧,编写了16个与白癜风相关的问题,这些问题分为描述性或推荐性,并有基础和高级两个级别。然后,三位白癜风专科皮肤科医生使用李克特量表,对这三个大语言模型的回答在准确性、可理解性和相关性方面进行评估。此外,三名患者对回答的可理解性进行评分,并进行可读性分析。
尽管三个大语言模型的表现有所不同,但它们在准确性、可理解性和完整性方面都表现令人满意。它们在回答基础描述性问题时准确率达到100%,但在处理复杂的推荐性问题时表现出不一致性,特别是在针对特定人群的治疗建议方面。两两比较表明,DeepSeek-R1在准确性得分上优于OpenAI o1(p = 0.042),而与Grok 3相比没有显著差异(p = 0.157)。可读性评估显示,所有模型的阅读难度都有所提高,DeepSeek-R1的可读性最低(平均弗莱什易读性得分19.7;两两比较显示,DeepSeek-R1的得分显著低于OpenAI o1和Grok 3,p均<0.01),这可能会降低不同患者群体的可及性。
推理型大语言模型在回答简单的白癜风相关问题时表现出较高的准确性,但随着问题复杂性的增加,治疗建议的质量会下降。当前模型在提供白癜风治疗建议时存在错误,开发者需要加强过滤机制,并且在医疗决策中必须有人类的监督。