Suppr超能文献

评估大语言模型在提取认知测试日期和分数方面的能力。

Evaluating Large Language Models in extracting cognitive exam dates and scores.

作者信息

Zhang Hao, Jethani Neil, Jones Simon, Genes Nicholas, Major Vincent J, Jaffe Ian S, Cardillo Anthony B, Heilenbach Noah, Ali Nadia Fazal, Bonanni Luke J, Clayburn Andrew J, Khera Zain, Sadler Erica C, Prasad Jaideep, Schlacter Jamie, Liu Kevin, Silva Benjamin, Montgomery Sophie, Kim Eric J, Lester Jacob, Hill Theodore M, Avoricani Alba, Chervonski Ethan, Davydov James, Small William, Chakravartty Eesha, Grover Himanshu, Dodson John A, Brody Abraham A, Aphinyanaphongs Yindalon, Masurkar Arjun, Razavian Narges

机构信息

NYU Grossman School of Medicine, New York, New York, United States of America.

Rory Meyers College of Nursing, New York University, New York, New York, United States of America.

出版信息

PLOS Digit Health. 2024 Dec 11;3(12):e0000685. doi: 10.1371/journal.pdig.0000685. eCollection 2024 Dec.

Abstract

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

摘要

确保大语言模型(LLMs)在临床任务中的可靠性至关重要。我们的研究评估了两种最先进的大语言模型(ChatGPT和LlaMA - 2)用于提取临床信息,重点关注像简易精神状态检查表(MMSE)和临床痴呆评定量表(CDR)这样的认知测试。我们的数据由135,307份临床记录(2010年1月12日至2023年5月24日)组成,这些记录提及了MMSE、CDR或蒙特利尔认知评估量表(MoCA)。应用纳入标准后,剩余34,465份记录,其中765份由ChatGPT(GPT - 4)和LlaMA - 2处理,22位专家对回复进行了评审。ChatGPT成功从742份记录中提取了带有日期的MMSE和CDR实例。我们使用20份记录来微调并训练评审人员。其余722份记录分配给评审人员,其中309份同时分配给两位评审人员。计算了评分者间一致性(Fleiss' Kappa)、精确率、召回率、真/假阴性率和准确率。我们的研究遵循TRIPOD报告指南进行模型验证。对于MMSE信息提取,ChatGPT(与LlaMA - 2相比)的准确率为83%(相比之下LlaMA - 2为66.4%),灵敏度为89.7%(相比之下LlaMA - 2为69.9%),真阴性率为96%(相比之下LlaMA - 2为60.0%),精确率为82.7%(相比之下LlaMA - 2为62.2%)。对于CDR,总体结果较低,准确率为87.1%(相比之下LlaMA - 2为74.5%),灵敏度为84.3%(相比之下LlaMA - 2为39.7%),真阴性率为99.8%(LlaMA - 2为98.4%),精确率为48.3%(相比之下LlaMA - 2为16.1%)。我们对经过双重评审的记录中ChatGPT和LlaMA - 2的MMSE错误进行了定性评估。LlaMA - 2的错误包括27例完全幻觉、19例报告了其他分数而非MMSE分数、25例漏报分数以及23例仅报告了错误日期的情况。相比之下,ChatGPT的错误仅包括3例完全幻觉、17例报告了错误的测试而非MMSE以及19例报告了错误日期的情况。在这项关于ChatGPT和LlaMA - 2从临床记录中提取认知检查日期和分数的诊断/预后研究中,ChatGPT表现出高准确率,与LlaMA - 2相比性能更佳。大语言模型的使用可以通过识别适合进行治疗开始或临床试验入组的患者,从而使痴呆症研究和临床护理受益。对大语言模型进行严格评估对于了解它们的能力和局限性至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5575/11634005/837314b3019f/pdig.0000685.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验