Suppr超能文献

使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

作者信息

Hanss Kaitlin, Sarma Karthik V, Glowinski Anne L, Krystal Andrew, Saunders Ramotse, Halls Andrew, Gorrell Sasha, Reilly Erin

机构信息

Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.

出版信息

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

Abstract

BACKGROUND

Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings.

OBJECTIVE

This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs).

METHODS

A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence.

RESULTS

On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001).

CONCLUSIONS

To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.

摘要

背景

大型语言模型(LLMs),如OpenAI的GPT - 3.5、GPT - 4和GPT - 4o,因其在心理健康领域的潜在应用,从文档支持到聊天机器人治疗,已引发了早期且广泛的关注。了解这些模型参数中存储的精神病学“知识”的准确性和可靠性,并制定对其回答的置信度衡量标准(即大型语言模型回答准确的可能性),对于将这些工具安全有效地整合到心理健康环境中至关重要。

目的

本研究旨在通过标准化的精神病学多项选择题(MCQs)评估GPT - 3.5(1750亿个参数)、GPT - 4(约1.8万亿个参数)和GPT - 4o(GPT - 4的优化版本,参数未知)的准确性、可靠性及准确性预测因素。

方法

进行了一项横断面研究,测试3个常用的商业大型语言模型(GPT - 3.5、GPT - 4和GPT - 4o)回答从《精神病学考试准备与复习手册》中提取的单项选择题(N = 150)的能力。每个模型对每个多项选择题生成10次答案。我们评估了答案的准确性和可靠性,并寻找答案准确性的预测因素。我们的主要结果是每个大型语言模型正确回答问题的比例(准确性)。次要指标包括:(1)10次试验中对多项选择题的回答一致性(可靠性);(2)多项选择题答案准确性与回答一致性之间的相关性;(3)多项选择题答案准确性与模型自我报告的置信度之间的相关性。

结果

首次尝试时,GPT - 3.5正确回答了58.0%(87/150)的多项选择题,而GPT - 4和GPT - 4o分别正确回答了84.0%(126/150)和87.3%(131/150)。GPT - 4和GPT - 4o在性能上无差异(P = 0.51),但它们显著优于GPT - 3.5(P <.001)。与其他模型相比,GPT - 3.5平均表现出较低的回答一致性(P <.001)。在所有模型中多项选择题回答一致性与多项选择题准确性呈正相关(GPT - 3.5、GPT - 4和GPT - 4o的r分别为0.340、0.682和0.590;均P <.001),而模型自我报告的置信度与模型中的准确性无相关性,但GPT - 3.5除外,其自我报告的置信度与准确性呈弱负相关(P <.001)。

结论

据我们所知,这是对商业可用大型语言模型中编码的一般精神病学知识的首次全面评估,也是第一项评估其可靠性并确定医学领域回答准确性预测因素的研究。研究结果表明,GPT - 4和GPT - 4o编码了准确可靠的一般精神病学知识,并且重复提示等方法可能提供一种衡量大型语言模型回答置信度的方法。这项工作支持了大型语言模型在心理健康环境中的潜力,并激励进一步研究以评估它们在更开放式临床环境中的表现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f8da/12134693/f7e92ceafabf/jmir_v27i1e69910_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验