Bentegeac Raphaël, Le Guellec Bastien, Kuchcinski Grégory, Amouyel Philippe, Hamroun Aghiles
Lille University, Lille University Hospital Center, Department of Neuroradiology, Rue Emile Laine, Lille, FR.
Univ. Lille, Inserm, Centre Hosp. Univ Lille, Institut Pasteur de Lille, UMR1167 - Labex DISTALZ - RID-AGE - Risk factors and molecular determinants of aging-related diseases, Lille, FR.
J Med Internet Res. 2025 Jul 1. doi: 10.2196/64348.
Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.
To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.
Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.
Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.
Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.
聊天机器人在医学领域已展现出颇具前景的能力,在各个专业的委员会考试中都取得了及格分数。然而,即便回答错误,它们仍倾向于对自己的回答表现出高度自信,这限制了其在临床环境中的实用性。
研究在预测聊天机器人对医学问题回答的准确性时,词元概率是否优于其表达的置信水平。
九个大语言模型(LLMs),包括商业模型(GPT - 3.5、GPT - 4和GPT - 4o)和开源模型(Llama 3.1 - 8b、Llama 3.1 - 70b、Phi - 3 - Mini、Phi - 3 - Medium、Gemma 2 - 9b和Gemma 2 - 27b),被要求回答来自美国医学执照考试(MedQA数据库)的一组2522个问题。此外,模型对其回答的置信度从0到100进行评分,并提取每个回答的词元概率。测量模型的成功率,并使用受试者操作特征曲线下面积(AUROCs)、适配校准误差(ACE)和布里尔分数评估表达的置信度和回答词元概率在预测回答准确性方面的预测性能。使用来自其他数据库的额外问题进行敏感性分析,这些问题包括英文(MedMCQA,n = 2797)、中文(中国大陆MedQA,n = 3413和台湾MedQA,n = 2808)和法文(FrMedMCQA,n = 1079),以及不同的提示策略和温度设置。
总体而言,Phi - 3 - Mini的平均准确率为56.5%[54.6 - 58.5],GPT - 4o的平均准确率为89.0%[87.7 - 90.2]。在美国医学执照考试的问题中,所有聊天机器人对其回答始终表现出高度自信(从Llama 3.1 - 70B的90[90 - 90]到GPT - 3.5的100[100 - 100])。然而,表达的置信度未能预测回答的准确性(AUROC范围从Phi 3 Mini的0.52[0.50 - 0.53]到GPT - 4o的0.68[0.65 - 0.71])。相比之下,在预测回答准确性方面,回答词元概率始终优于表达的置信度(Phi 3 mini的AUROCs范围从0.71[0.69 - 0.73]到GPT - 4o的0.87[0.85 - 0.89],所有p值<0.001)。此外,所有模型都表现出校准不完善,总体趋势是过度自信。这些发现在敏感性分析中是一致的。
由于聊天机器人在回答医学问题时准确评估其置信度的能力有限,临床医生和患者不应依赖其自我评定的确定性。相反,词元概率成为一种有前景且易于获取的替代方法,用于衡量这些模型内心的疑虑。