• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一项使用阿施范式检验大型语言模型在精神科评估中一致性的对照试验。

A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.

作者信息

Shoval Dorit Hadar, Gigi Karny, Haber Yuval, Itzhaki Amir, Asraf Kfir, Piterman David, Elyoseph Zohar

机构信息

The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Yezreel Valley, Israel.

The Institute for Research and Development, The Artificial Third, Tel Aviv, Israel.

出版信息

BMC Psychiatry. 2025 May 12;25(1):478. doi: 10.1186/s12888-025-06912-2.

DOI:10.1186/s12888-025-06912-2
PMID:40355854
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12070653/
Abstract

BACKGROUND

Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.

METHODS

Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.

RESULTS

Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).

CONCLUSIONS

This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.

TRIAL REGISTRATION

Not applicable.

摘要

背景

尽管人工智能驱动的医学诊断取得了重大进展,但将大语言模型(LLMs)整合到精神病学实践中仍面临独特挑战。虽然大语言模型在可控环境中表现出较高的准确性,但其在协作临床环境中的表现仍不明确。本研究探讨了大语言模型在不同诊断确定性水平下,面对社会压力时是否会表现出从众行为,特别关注精神病学评估。

方法

我们采用改编后的阿施范式进行了一项对照试验,考察GPT-4o在代表诊断不确定性逐渐增加的三个领域中的表现:圆形相似性判断(高确定性)、脑肿瘤识别(中等确定性)以及使用儿童绘画进行的精神病学评估(高不确定性)。该研究采用3×3析因设计,有三种压力条件:无压力、完全压力(五个连续的同伴错误回答)和部分压力(同伴正确和错误回答混合)。我们对每个条件组合进行10次试验(共90次观察),使用标准化提示和多项选择回答。二项式检验和卡方分析评估了不同条件下的表现差异。

结果

在无压力条件下,GPT-4o在所有领域的准确率均达到100%。在完全压力下,随着诊断不确定性的增加,准确率系统地下降:圆形识别中为50%,肿瘤识别中为40%,精神病学评估中为0%。部分压力显示出类似的模式,基本任务的准确率保持不变(圆形识别中为80%,肿瘤识别中为100%),但精神病学评估完全失败(0%)。无压力和有压力条件之间的所有差异均具有统计学意义(P<.05),在精神病学评估中观察到的影响最为严重(χ²₁=16.20,P<.001)。

结论

本研究表明,大语言模型表现出从众模式,且随着诊断不确定性的增加而加剧,在社会压力下精神病学评估中最终导致完全的表现失败。这些发现表明,在精神病学中成功实施人工智能需要仔细考虑社会动态以及精神病学诊断中固有的不确定性。未来的研究应在不同的人工智能系统和诊断工具中验证这些发现,同时制定策略以在临床环境中保持人工智能的独立性。

试验注册

不适用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/2fb95ae144a7/12888_2025_6912_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/d800236ffef6/12888_2025_6912_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/2fb95ae144a7/12888_2025_6912_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/d800236ffef6/12888_2025_6912_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a737/12070653/2fb95ae144a7/12888_2025_6912_Fig2_HTML.jpg

相似文献

1
A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.一项使用阿施范式检验大型语言模型在精神科评估中一致性的对照试验。
BMC Psychiatry. 2025 May 12;25(1):478. doi: 10.1186/s12888-025-06912-2.
2
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Multimodal LLMs for retinal disease diagnosis via OCT: few-shot versus single-shot learning.通过光学相干断层扫描(OCT)进行视网膜疾病诊断的多模态语言模型:少样本学习与单样本学习
Ther Adv Ophthalmol. 2025 May 20;17:25158414251340569. doi: 10.1177/25158414251340569. eCollection 2025 Jan-Dec.
5
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估:观察性比较案例研究
J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.
6
Diagnostic accuracy of large language models in psychiatry.精神科大语言模型的诊断准确性。
Asian J Psychiatr. 2024 Oct;100:104168. doi: 10.1016/j.ajp.2024.104168. Epub 2024 Jul 25.
7
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现:比较分析。
JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.
8
Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings.GPT-4o和Claude 3 Opus根据病史和尸检CT结果确定死因的诊断性能
Cureus. 2024 Aug 20;16(8):e67306. doi: 10.7759/cureus.67306. eCollection 2024 Aug.
9
Urban walkability through different lenses: A comparative study of GPT-4o and human perceptions.不同视角下的城市步行适宜性:GPT-4o与人类认知的比较研究
PLoS One. 2025 Apr 29;20(4):e0322078. doi: 10.1371/journal.pone.0322078. eCollection 2025.
10
Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study.使用大语言模型在急诊科进行患者分诊和指导:多指标研究
J Med Internet Res. 2025 May 15;27:e71613. doi: 10.2196/71613.

本文引用的文献

1
Beyond clinical observations: a scoping review of AI-detectable observable cues in borderline personality disorder.超越临床观察:边缘型人格障碍中人工智能可检测到的可观察线索的范围综述
Front Psychiatry. 2024 Dec 10;15:1345916. doi: 10.3389/fpsyt.2024.1345916. eCollection 2024.
2
Social conformity is a heuristic when individual risky decision-making is disrupted.当个体的风险决策受到干扰时,社会从众是一种启发式方法。
PLoS Comput Biol. 2024 Dec 2;20(12):e1012602. doi: 10.1371/journal.pcbi.1012602. eCollection 2024 Dec.
3
Large Language Models-Misdiagnosing Diagnostic Excellence?
大语言模型——误诊卓越诊断能力?
JAMA Netw Open. 2024 Oct 1;7(10):e2440901. doi: 10.1001/jamanetworkopen.2024.40901.
4
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
5
Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas.嵌入价值观塑造大型语言模型在初级保健伦理困境中的伦理推理。
Heliyon. 2024 Sep 19;10(18):e38056. doi: 10.1016/j.heliyon.2024.e38056. eCollection 2024 Sep 30.
6
Intrarater and Inter-rater Reliability of Tibial Plateau Fracture Classifications: Systematic Review and Meta-Analysis.胫骨平台骨折分类的评估者内和评估者间可靠性:系统评价与荟萃分析
JB JS Open Access. 2024 Oct 3;9(4). doi: 10.2106/JBJS.OA.23.00181. eCollection 2024 Oct-Dec.
7
Schizophrenia Spectrum Disorders: An Empirical Benchmark Study of Real-world Diagnostic Accuracy and Reliability Among Leading International Psychiatrists.精神分裂症谱系障碍:一项关于国际顶尖精神科医生在现实世界中诊断准确性和可靠性的实证基准研究。
Schizophr Bull Open. 2024 May 3;5(1):sgae012. doi: 10.1093/schizbullopen/sgae012. eCollection 2024 Jan.
8
Medical artificial intelligence for clinicians: the lost cognitive perspective.临床医生的医学人工智能:失落的认知视角。
Lancet Digit Health. 2024 Aug;6(8):e589-e594. doi: 10.1016/S2589-7500(24)00095-5.
9
Addressing 6 challenges in generative AI for digital health: A scoping review.应对数字健康领域生成式人工智能的六大挑战:一项范围综述
PLOS Digit Health. 2024 May 23;3(5):e0000503. doi: 10.1371/journal.pdig.0000503. eCollection 2024 May.
10
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.