使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性：横断面研究

Hanss Kaitlin, Sarma Karthik V, Glowinski Anne L, Krystal Andrew, Saunders Ramotse, Halls Andrew, Gorrell Sasha, Reilly Erin

Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

BACKGROUND

Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings.

OBJECTIVE

This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs).

METHODS

A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence.

RESULTS

On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001).

CONCLUSIONS

To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.

背景

大型语言模型（LLMs），如OpenAI的GPT - 3.5、GPT - 4和GPT - 4o，因其在心理健康领域的潜在应用，从文档支持到聊天机器人治疗，已引发了早期且广泛的关注。了解这些模型参数中存储的精神病学“知识”的准确性和可靠性，并制定对其回答的置信度衡量标准（即大型语言模型回答准确的可能性），对于将这些工具安全有效地整合到心理健康环境中至关重要。

目的

本研究旨在通过标准化的精神病学多项选择题（MCQs）评估GPT - 3.5（1750亿个参数）、GPT - 4（约1.8万亿个参数）和GPT - 4o（GPT - 4的优化版本，参数未知）的准确性、可靠性及准确性预测因素。

方法

进行了一项横断面研究，测试3个常用的商业大型语言模型（GPT - 3.5、GPT - 4和GPT - 4o）回答从《精神病学考试准备与复习手册》中提取的单项选择题（N = 150）的能力。每个模型对每个多项选择题生成10次答案。我们评估了答案的准确性和可靠性，并寻找答案准确性的预测因素。我们的主要结果是每个大型语言模型正确回答问题的比例（准确性）。次要指标包括：（1）10次试验中对多项选择题的回答一致性（可靠性）；（2）多项选择题答案准确性与回答一致性之间的相关性；（3）多项选择题答案准确性与模型自我报告的置信度之间的相关性。

结果

首次尝试时，GPT - 3.5正确回答了58.0%（87/150）的多项选择题，而GPT - 4和GPT - 4o分别正确回答了84.0%（126/150）和87.3%（131/150）。GPT - 4和GPT - 4o在性能上无差异（P = 0.51），但它们显著优于GPT - 3.5（P <.001）。与其他模型相比，GPT - 3.5平均表现出较低的回答一致性（P <.001）。在所有模型中多项选择题回答一致性与多项选择题准确性呈正相关（GPT - 3.5、GPT - 4和GPT - 4o的r分别为0.340、0.682和0.590；均P <.001），而模型自我报告的置信度与模型中的准确性无相关性，但GPT - 3.5除外，其自我报告的置信度与准确性呈弱负相关（P <.001）。

结论

据我们所知，这是对商业可用大型语言模型中编码的一般精神病学知识的首次全面评估，也是第一项评估其可靠性并确定医学领域回答准确性预测因素的研究。研究结果表明，GPT - 4和GPT - 4o编码了准确可靠的一般精神病学知识，并且重复提示等方法可能提供一种衡量大型语言模型回答置信度的方法。这项工作支持了大型语言模型在心理健康环境中的潜力，并激励进一步研究以评估它们在更开放式临床环境中的表现。

相似文献

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework.

JAMA Pediatr. 2025 Jul 7. doi: 10.1001/jamapediatrics.2025.1729.

Potential of ChatGPT in youth mental health emergency triage: Comparative analysis with clinicians.

PCN Rep. 2025 Jul 15;4(3):e70159. doi: 10.1002/pcn5.70159. eCollection 2025 Sep.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.

J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

Sexual Harassment and Prevention Training

本文引用的文献

Large Language Models and Artificial Intelligence in Psychiatry Medical Education: Augmenting But Not Replacing Best Practices.

Acad Psychiatry. 2025 Feb;49(1):22-24. doi: 10.1007/s40596-024-01996-6. Epub 2024 Aug 6.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Large language models in psychiatry: Opportunities and challenges.

Psychiatry Res. 2024 Sep;339:116026. doi: 10.1016/j.psychres.2024.116026. Epub 2024 Jun 11.

Artificial intelligence and illusions of understanding in scientific research.

Nature. 2024 Mar;627(8002):49-58. doi: 10.1038/s41586-024-07146-0. Epub 2024 Mar 6.

How Artificial Intelligence Is Shaping Medical Imaging Technology: A Survey of Innovations and Applications.

Bioengineering (Basel). 2023 Dec 18;10(12):1435. doi: 10.3390/bioengineering10121435.

Mental health misinformation on social media: Review and future directions.

Curr Opin Psychol. 2024 Apr;56:101738. doi: 10.1016/j.copsyc.2023.101738. Epub 2023 Nov 14.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.

Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

The future landscape of large language models in medicine.

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

The now and future of ChatGPT and GPT in psychiatry.

Psychiatry Clin Neurosci. 2023 Nov;77(11):592-596. doi: 10.1111/pcn.13588. Epub 2023 Sep 11.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework.

JAMA Pediatr. 2025 Jul 7. doi: 10.1001/jamapediatrics.2025.1729.

Potential of ChatGPT in youth mental health emergency triage: Comparative analysis with clinicians.

PCN Rep. 2025 Jul 15;4(3):e70159. doi: 10.1002/pcn5.70159. eCollection 2025 Sep.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.

J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

Sexual Harassment and Prevention Training

本文引用的文献

Large Language Models and Artificial Intelligence in Psychiatry Medical Education: Augmenting But Not Replacing Best Practices.

Acad Psychiatry. 2025 Feb;49(1):22-24. doi: 10.1007/s40596-024-01996-6. Epub 2024 Aug 6.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Large language models in psychiatry: Opportunities and challenges.

Psychiatry Res. 2024 Sep;339:116026. doi: 10.1016/j.psychres.2024.116026. Epub 2024 Jun 11.

Artificial intelligence and illusions of understanding in scientific research.

Nature. 2024 Mar;627(8002):49-58. doi: 10.1038/s41586-024-07146-0. Epub 2024 Mar 6.

How Artificial Intelligence Is Shaping Medical Imaging Technology: A Survey of Innovations and Applications.

Bioengineering (Basel). 2023 Dec 18;10(12):1435. doi: 10.3390/bioengineering10121435.

Mental health misinformation on social media: Review and future directions.

Curr Opin Psychol. 2024 Apr;56:101738. doi: 10.1016/j.copsyc.2023.101738. Epub 2023 Nov 14.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.

Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

The future landscape of large language models in medicine.

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

The now and future of ChatGPT and GPT in psychiatry.

Psychiatry Clin Neurosci. 2023 Nov;77(11):592-596. doi: 10.1111/pcn.13588. Epub 2023 Sep 11.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献