Department of Medical Informatics, West China Medical School, Sichuan University, Chengdu, China.
West China College of Stomatology, Sichuan University, Chengdu, China.
J Med Internet Res. 2023 Dec 29;25:e51501. doi: 10.2196/51501.
Artificial intelligence models tailored to diagnose cognitive impairment have shown excellent results. However, it is unclear whether large linguistic models can rival specialized models by text alone.
In this study, we explored the performance of ChatGPT for primary screening of mild cognitive impairment (MCI) and standardized the design steps and components of the prompts.
We gathered a total of 174 participants from the DementiaBank screening and classified 70% of them into the training set and 30% of them into the test set. Only text dialogues were kept. Sentences were cleaned using a macro code, followed by a manual check. The prompt consisted of 5 main parts, including character setting, scoring system setting, indicator setting, output setting, and explanatory information setting. Three dimensions of variables from published studies were included: vocabulary (ie, word frequency and word ratio, phrase frequency and phrase ratio, and lexical complexity), syntax and grammar (ie, syntactic complexity and grammatical components), and semantics (ie, semantic density and semantic coherence). We used R 4.3.0. for the analysis of variables and diagnostic indicators.
Three additional indicators related to the severity of MCI were incorporated into the final prompt for the model. These indicators were effective in discriminating between MCI and cognitively normal participants: tip-of-the-tongue phenomenon (P<.001), difficulty with complex ideas (P<.001), and memory issues (P<.001). The final GPT-4 model achieved a sensitivity of 0.8636, a specificity of 0.9487, and an area under the curve of 0.9062 on the training set; on the test set, the sensitivity, specificity, and area under the curve reached 0.7727, 0.8333, and 0.8030, respectively.
ChatGPT was effective in the primary screening of participants with possible MCI. Improved standardization of prompts by clinicians would also improve the performance of the model. It is important to note that ChatGPT is not a substitute for a clinician making a diagnosis.
针对认知障碍进行诊断的人工智能模型已经显示出了优异的效果。然而,仅凭文本,大型语言模型是否能与专业模型相媲美还不清楚。
本研究旨在探索 ChatGPT 在轻度认知障碍(MCI)初步筛查中的性能,并对提示的设计步骤和组件进行标准化。
我们从 DementiaBank 筛查中总共招募了 174 名参与者,并将其中 70%的参与者归入训练集,30%的参与者归入测试集。仅保留文本对话。使用宏代码清理句子,然后进行手动检查。提示包含 5 个主要部分,包括角色设定、评分系统设定、指标设定、输出设定和说明信息设定。纳入了来自已发表研究的三个维度的变量:词汇(即,单词频率和单词比、词组频率和词组比、词汇复杂性)、句法和语法(即,句法复杂性和语法成分)和语义(即,语义密度和语义连贯性)。我们使用 R 4.3.0 对变量和诊断指标进行分析。
为模型纳入了 3 个与 MCI 严重程度相关的额外指标。这些指标对于区分 MCI 和认知正常参与者非常有效:舌尖现象(P<.001)、复杂想法困难(P<.001)和记忆问题(P<.001)。最终的 GPT-4 模型在训练集上的敏感性为 0.8636、特异性为 0.9487 和曲线下面积为 0.9062;在测试集上,敏感性、特异性和曲线下面积分别达到 0.7727、0.8333 和 0.8030。
ChatGPT 可有效用于初步筛查可能患有 MCI 的参与者。临床医生对提示进行更标准化的改进也将提高模型的性能。需要注意的是,ChatGPT 不能替代临床医生进行诊断。