评估大型语言模型在提取认知测试日期和分数方面的能力。

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores.

作者信息

Zhang Hao, Jethani Neil, Jones Simon, Genes Nicholas, Major Vincent J, Jaffe Ian S, Cardillo Anthony B, Heilenbach Noah, Ali Nadia Fazal, Bonanni Luke J, Clayburn Andrew J, Khera Zain, Sadler Erica C, Prasad Jaideep, Schlacter Jamie, Liu Kevin, Silva Benjamin, Montgomery Sophie, Kim Eric J, Lester Jacob, Hill Theodore M, Avoricani Alba, Chervonski Ethan, Davydov James, Small William, Chakravartty Eesha, Grover Himanshu, Dodson John A, Brody Abraham A, Aphinyanaphongs Yindalon, Masurkar Arjun, Razavian Narges

机构信息

NYU Grossman School of Medicine.

NYU Rory Meyers College of Nursing, NYU Grossman School of Medicine.

出版信息

medRxiv. 2024 Feb 13:2023.07.10.23292373. doi: 10.1101/2023.07.10.23292373.

IMPORTANCE

Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR.

OBJECTIVE

Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates.

METHODS

Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation.

RESULTS

For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date.

CONCLUSIONS

In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

重要性

大语言模型（LLMs）对医学任务至关重要。确保其可靠性对于避免错误结果至关重要。我们的研究评估了两种最先进的大语言模型（ChatGPT和LlaMA - 2）在提取临床信息方面的表现，重点关注像简易精神状态检查表（MMSE）和临床痴呆评定量表（CDR）这样的认知测试。

目的

评估ChatGPT和LlaMA - 2在提取MMSE和CDR分数及其相关日期方面的性能。

方法

我们的数据包括135,307份临床记录（2010年1月12日至2023年5月24日），这些记录提及了MMSE、CDR或蒙特利尔认知评估量表（MoCA）。应用纳入标准后，剩余34,465份记录，其中765份记录由ChatGPT（GPT - 4）和LlaMA - 2进行处理，22位专家对回复进行了审查。ChatGPT成功从742份记录中提取了带有日期的MMSE和CDR实例。我们使用20份记录对审查人员进行微调与培训。其余722份记录分配给审查人员，其中309份记录同时分配给两位审查人员。计算了评分者间一致性（Fleiss' Kappa）、精确率、召回率、真/假阴性率以及准确率。我们的研究遵循TRIPOD报告指南进行模型验证。

结果

对于MMSE信息提取，ChatGPT（与LlaMA - 2相比）的准确率为83%（LlaMA - 2为66.4%），敏感度为89.7%（LlaMA - 2为69.9%），真阴性率为96%（LlaMA - 2为60.0%），精确率为82.7%（LlaMA - 2为62.2%）。对于CDR，总体结果较低，准确率为87.1%（LlaMA - 2为74.5%），敏感度为84.3%（LlaMA - 2为39.7%），真阴性率为99.8%（LlaMA - 2为98.4%），精确率为48.3%（LlaMA - 2为16.1%）。我们对经过双重审查的记录中ChatGPT和LlaMA - 2的MMSE错误进行了定性评估。LlaMA - 错误包括27例完全幻觉、19例报告的是其他分数而非MMSE分数、25例遗漏分数以及23例仅报告错误日期的情况。相比之下，ChatGPT的错误仅包括3例完全幻觉、17例报告的是错误测试而非MMSE以及19例报告错误日期的情况。

结论

在这项关于ChatGPT和LlaMA - 2从临床记录中提取认知检查日期和分数的诊断/预后研究中，ChatGPT表现出高准确率，与LlaMA - 2相比性能更优。大语言模型的使用可以通过识别适合治疗初始化或临床试验入组的患者，使痴呆症研究和临床护理受益。对大语言模型进行严格评估对于了解其能力和局限性至关重要。

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

重要性

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献