Shoval Dorit Hadar, Gigi Karny, Haber Yuval, Itzhaki Amir, Asraf Kfir, Piterman David, Elyoseph Zohar
The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Yezreel Valley, Israel.
The Institute for Research and Development, The Artificial Third, Tel Aviv, Israel.
BMC Psychiatry. 2025 May 12;25(1):478. doi: 10.1186/s12888-025-06912-2.
BACKGROUND: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment. METHODS: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions. RESULTS: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001). CONCLUSIONS: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings. TRIAL REGISTRATION: Not applicable.
背景:尽管人工智能驱动的医学诊断取得了重大进展,但将大语言模型(LLMs)整合到精神病学实践中仍面临独特挑战。虽然大语言模型在可控环境中表现出较高的准确性,但其在协作临床环境中的表现仍不明确。本研究探讨了大语言模型在不同诊断确定性水平下,面对社会压力时是否会表现出从众行为,特别关注精神病学评估。 方法:我们采用改编后的阿施范式进行了一项对照试验,考察GPT-4o在代表诊断不确定性逐渐增加的三个领域中的表现:圆形相似性判断(高确定性)、脑肿瘤识别(中等确定性)以及使用儿童绘画进行的精神病学评估(高不确定性)。该研究采用3×3析因设计,有三种压力条件:无压力、完全压力(五个连续的同伴错误回答)和部分压力(同伴正确和错误回答混合)。我们对每个条件组合进行10次试验(共90次观察),使用标准化提示和多项选择回答。二项式检验和卡方分析评估了不同条件下的表现差异。 结果:在无压力条件下,GPT-4o在所有领域的准确率均达到100%。在完全压力下,随着诊断不确定性的增加,准确率系统地下降:圆形识别中为50%,肿瘤识别中为40%,精神病学评估中为0%。部分压力显示出类似的模式,基本任务的准确率保持不变(圆形识别中为80%,肿瘤识别中为100%),但精神病学评估完全失败(0%)。无压力和有压力条件之间的所有差异均具有统计学意义(P<.05),在精神病学评估中观察到的影响最为严重(χ²₁=16.20,P<.001)。 结论:本研究表明,大语言模型表现出从众模式,且随着诊断不确定性的增加而加剧,在社会压力下精神病学评估中最终导致完全的表现失败。这些发现表明,在精神病学中成功实施人工智能需要仔细考虑社会动态以及精神病学诊断中固有的不确定性。未来的研究应在不同的人工智能系统和诊断工具中验证这些发现,同时制定策略以在临床环境中保持人工智能的独立性。 试验注册:不适用。
Cochrane Database Syst Rev. 2022-2-1
Ther Adv Ophthalmol. 2025-5-20
Asian J Psychiatr. 2024-10
JMIR Med Educ. 2024-11-6
J Med Internet Res. 2025-5-15
PLoS Comput Biol. 2024-12-2
JAMA Netw Open. 2024-10-1
JAMA Netw Open. 2024-10-1
Lancet Digit Health. 2024-8
PLOS Digit Health. 2024-5-23