文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

一项使用阿施范式检验大型语言模型在精神科评估中一致性的对照试验。

A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.

作者信息

Shoval Dorit Hadar, Gigi Karny, Haber Yuval, Itzhaki Amir, Asraf Kfir, Piterman David, Elyoseph Zohar

机构信息

The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Yezreel Valley, Israel.

The Institute for Research and Development, The Artificial Third, Tel Aviv, Israel.

出版信息

BMC Psychiatry. 2025 May 12;25(1):478. doi: 10.1186/s12888-025-06912-2.


DOI:10.1186/s12888-025-06912-2
PMID:40355854
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12070653/
Abstract

BACKGROUND: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment. METHODS: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions. RESULTS: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001). CONCLUSIONS: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings. TRIAL REGISTRATION: Not applicable.

摘要

背景:尽管人工智能驱动的医学诊断取得了重大进展,但将大语言模型(LLMs)整合到精神病学实践中仍面临独特挑战。虽然大语言模型在可控环境中表现出较高的准确性,但其在协作临床环境中的表现仍不明确。本研究探讨了大语言模型在不同诊断确定性水平下,面对社会压力时是否会表现出从众行为,特别关注精神病学评估。 方法:我们采用改编后的阿施范式进行了一项对照试验,考察GPT-4o在代表诊断不确定性逐渐增加的三个领域中的表现:圆形相似性判断(高确定性)、脑肿瘤识别(中等确定性)以及使用儿童绘画进行的精神病学评估(高不确定性)。该研究采用3×3析因设计,有三种压力条件:无压力、完全压力(五个连续的同伴错误回答)和部分压力(同伴正确和错误回答混合)。我们对每个条件组合进行10次试验(共90次观察),使用标准化提示和多项选择回答。二项式检验和卡方分析评估了不同条件下的表现差异。 结果:在无压力条件下,GPT-4o在所有领域的准确率均达到100%。在完全压力下,随着诊断不确定性的增加,准确率系统地下降:圆形识别中为50%,肿瘤识别中为40%,精神病学评估中为0%。部分压力显示出类似的模式,基本任务的准确率保持不变(圆形识别中为80%,肿瘤识别中为100%),但精神病学评估完全失败(0%)。无压力和有压力条件之间的所有差异均具有统计学意义(P<.05),在精神病学评估中观察到的影响最为严重(χ²₁=16.20,P<.001)。 结论:本研究表明,大语言模型表现出从众模式,且随着诊断不确定性的增加而加剧,在社会压力下精神病学评估中最终导致完全的表现失败。这些发现表明,在精神病学中成功实施人工智能需要仔细考虑社会动态以及精神病学诊断中固有的不确定性。未来的研究应在不同的人工智能系统和诊断工具中验证这些发现,同时制定策略以在临床环境中保持人工智能的独立性。 试验注册:不适用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/2fb95ae144a7/12888_2025_6912_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/d800236ffef6/12888_2025_6912_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/2fb95ae144a7/12888_2025_6912_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/d800236ffef6/12888_2025_6912_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/2fb95ae144a7/12888_2025_6912_Fig2_HTML.jpg

相似文献

[1]
A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.

BMC Psychiatry. 2025-5-12

[2]
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025-5-20

[3]
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022-2-1

[4]
Multimodal LLMs for retinal disease diagnosis via OCT: few-shot versus single-shot learning.

Ther Adv Ophthalmol. 2025-5-20

[5]
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.

J Med Internet Res. 2025-4-28

[6]
Diagnostic accuracy of large language models in psychiatry.

Asian J Psychiatr. 2024-10

[7]
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.

JMIR Med Educ. 2024-11-6

[8]
Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings.

Cureus. 2024-8-20

[9]
Urban walkability through different lenses: A comparative study of GPT-4o and human perceptions.

PLoS One. 2025-4-29

[10]
Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study.

J Med Internet Res. 2025-5-15

本文引用的文献

[1]
Beyond clinical observations: a scoping review of AI-detectable observable cues in borderline personality disorder.

Front Psychiatry. 2024-12-10

[2]
Social conformity is a heuristic when individual risky decision-making is disrupted.

PLoS Comput Biol. 2024-12-2

[3]
Large Language Models-Misdiagnosing Diagnostic Excellence?

JAMA Netw Open. 2024-10-1

[4]
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.

JAMA Netw Open. 2024-10-1

[5]
Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas.

Heliyon. 2024-9-19

[6]
Intrarater and Inter-rater Reliability of Tibial Plateau Fracture Classifications: Systematic Review and Meta-Analysis.

JB JS Open Access. 2024-10-3

[7]
Schizophrenia Spectrum Disorders: An Empirical Benchmark Study of Real-world Diagnostic Accuracy and Reliability Among Leading International Psychiatrists.

Schizophr Bull Open. 2024-5-3

[8]
Medical artificial intelligence for clinicians: the lost cognitive perspective.

Lancet Digit Health. 2024-8

[9]
Addressing 6 challenges in generative AI for digital health: A scoping review.

PLOS Digit Health. 2024-5-23

[10]
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.

JMIR Ment Health. 2024-4-9

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索