ChatGPT与Bard在检测阿尔茨海默病性痴呆方面的性能评估

Performance Assessment of ChatGPT versus Bard in Detecting Alzheimer's Dementia.

作者信息

B T Balamurali, Chen Jer-Ming

机构信息

Science, Mathematics & Technology (SMT), Singapore University of Technology & Design, 8 Somapah Rd, Singapore 487372, Singapore.

出版信息

Diagnostics (Basel). 2024 Apr 15;14(8):817. doi: 10.3390/diagnostics14080817.

DOI:10.3390/diagnostics14080817

PMID:38667463

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11048951/

Abstract

Large language models (LLMs) find increasing applications in many fields. Here, three LLM chatbots (ChatGPT-3.5, ChatGPT-4, and Bard) are assessed in their current form, as publicly available, for their ability to recognize Alzheimer's dementia (AD) and Cognitively Normal (CN) individuals using textual input derived from spontaneous speech recordings. A zero-shot learning approach is used at two levels of independent queries, with the second query (chain-of-thought prompting) eliciting more detailed information than the first. Each LLM chatbot's performance is evaluated on the prediction generated in terms of accuracy, sensitivity, specificity, precision, and F1 score. LLM chatbots generated a three-class outcome ("AD", "CN", or "Unsure"). When positively identifying AD, Bard produced the highest true-positives (89% recall) and highest F1 score (71%), but tended to misidentify CN as AD, with high confidence (low "Unsure" rates); for positively identifying CN, GPT-4 resulted in the highest true-negatives at 56% and highest F1 score (62%), adopting a diplomatic stance (moderate "Unsure" rates). Overall, the three LLM chatbots can identify AD vs. CN, surpassing chance-levels, but do not currently satisfy the requirements for clinical application.

摘要

大语言模型（LLMs）在许多领域的应用越来越广泛。在此，对三个大语言模型聊天机器人（ChatGPT-3.5、ChatGPT-4和Bard）的当前公开可用形式进行评估，以考察它们利用源自自发语音记录的文本输入识别阿尔茨海默病痴呆症（AD）患者和认知正常（CN）个体的能力。在两个独立查询级别上采用零样本学习方法，第二个查询（思维链提示）比第一个查询能引出更详细的信息。根据准确性、敏感性、特异性、精确性和F1分数对每个大语言模型聊天机器人生成的预测结果进行评估。大语言模型聊天机器人生成了一个三类结果（“AD”、“CN”或“不确定”）。在阳性识别AD时，Bard产生的真阳性率最高（召回率89%）且F1分数最高（71%），但倾向于将CN误识别为AD，且置信度较高（“不确定”率较低）；在阳性识别CN时，GPT-4的真阴性率最高，为56%，F1分数最高（62%），采取了一种折中的立场（“不确定”率适中）。总体而言，这三个大语言模型聊天机器人能够识别AD与CN，超过了随机水平，但目前尚不满足临床应用的要求。