Cheng Huai Yong
Minneapolis VA Health Care System, Minneapolis, MN, United States.
JMIR Form Res. 2025 Jan 3;9:e63494. doi: 10.2196/63494.
The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics.
This study aimed to evaluate ChatGPT's trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT's geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls).
We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT's geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature. We also evaluated ChatGPT's application to 2 vignettes of geriatric syndromes (polypharmacy and falls).
The mean total score on geriatrics attitude of ChatGPT was significantly lower than that of trainees (medical students, internal medicine residents, and geriatric medicine fellows; 2.7 vs 3.7 on a scale from 1-5; 1=strongly disagree; 5=strongly agree). The mean subscore on positive geriatrics attitude of ChatGPT was higher than that of the trainees (medical students, internal medicine residents, and neurologists; 4.1 vs 3.7 on a scale from 1 to 5 where a higher score means a more positive attitude toward older adults). The mean subscore on negative geriatrics attitude of ChatGPT was lower than that of the trainees and neurologists (1.8 vs 2.8 on a scale from 1 to 5 where a lower subscore means a less negative attitude toward aging). On the University of California, Los Angeles geriatrics knowledge test, ChatGPT outperformed all medical students, internal medicine residents, and geriatric medicine fellows from validated studies (14.7 vs 11.3 with a score range of -18 to +18 where +18 means that all questions were answered correctly). Regarding the polypharmacy vignette, ChatGPT not only demonstrated solid knowledge of potentially inappropriate medications but also accurately identified 7 common potentially inappropriate medications and 5 drug-drug and 3 drug-disease interactions. However, ChatGPT missed 5 drug-disease and 1 drug-drug interaction and produced 2 hallucinations. Regarding the fall vignette, ChatGPT answered 3 of 5 pretests correctly and 2 of 5 pretests partially correctly, identified 6 categories of fall risks, followed fall guidelines correctly, listed 6 key physical examinations, and recommended 6 categories of fall prevention methods.
This study suggests that ChatGPT can be a valuable supplemental tool in geriatrics, offering reliable information with less age bias, robust geriatrics knowledge, and comprehensive recommendations for managing 2 common geriatric syndromes (polypharmacy and falls) that are consistent with evidence from guidelines, systematic reviews, and other types of studies. ChatGPT's potential as an educational and clinical resource could significantly benefit trainees, health care providers, and laypeople. Further research using GPT-4o, larger geriatrics question sets, and more geriatric syndromes is needed to expand and confirm these findings before adopting ChatGPT widely for geriatrics education and practice.
ChatGPT在临床实践和医学教育中的使用日益增加,因此有必要评估其可靠性,尤其是在老年医学领域。
本研究旨在通过三种不同方法评估ChatGPT在老年医学方面的可信度:通过两个老年综合征(多重用药和跌倒)案例 vignettes 评估ChatGPT的老年医学态度、知识和临床应用。
我们使用经过验证的加州大学洛杉矶分校老年医学态度和知识工具来评估ChatGPT的老年医学态度和知识,并将其表现与医学学生、住院医师以及文献报道结果中的老年医学研究员进行比较。我们还评估了ChatGPT在两个老年综合征(多重用药和跌倒)案例 vignettes 中的应用。
ChatGPT在老年医学态度方面的平均总分显著低于受训人员(医学学生、内科住院医师和老年医学研究员;在1 - 5分的量表上为2.7分对3.7分;1 = 强烈不同意;5 = 强烈同意)。ChatGPT在积极老年医学态度方面的平均子分数高于受训人员(医学学生、内科住院医师和神经科医生;在1至5分的量表上为4.1分对3.7分,分数越高表示对老年人的态度越积极)。ChatGPT在消极老年医学态度方面的平均子分数低于受训人员和神经科医生(在1至5分的量表上为1.8分对2.8分,子分数越低表示对衰老的负面态度越少)。在加州大学洛杉矶分校老年医学知识测试中,ChatGPT在经过验证的研究中表现优于所有医学学生、内科住院医师和老年医学研究员(14.7分对11.3分,分数范围为 -18至 +18,其中 +18表示所有问题都回答正确)。关于多重用药案例 vignette,ChatGPT不仅展示了对潜在不适当药物的扎实知识,还准确识别了7种常见的潜在不适当药物以及5种药物相互作用和3种药物 - 疾病相互作用。然而,ChatGPT遗漏了5种药物 - 疾病相互作用和1种药物相互作用,并产生了2次幻觉。关于跌倒案例 vignette,ChatGPT在5个预测试题中正确回答了3个,部分正确回答了2个,识别了6类跌倒风险,正确遵循了跌倒指南,列出了6项关键体格检查,并推荐了6类跌倒预防方法。
本研究表明,ChatGPT可以成为老年医学中有价值的辅助工具,提供具有较少年龄偏见的可靠信息、强大的老年医学知识以及针对管理两种常见老年综合征(多重用药和跌倒)的全面建议,这些建议与指南、系统评价和其他类型研究的证据一致。ChatGPT作为教育和临床资源的潜力可以显著造福受训人员、医疗保健提供者和普通民众。在广泛将ChatGPT用于老年医学教育和实践之前,需要使用GPT - 4o、更大的老年医学问题集以及更多老年综合征进行进一步研究,以扩展和确认这些发现。