• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大语言模型在提取认知测试日期和分数方面的能力。

Evaluating Large Language Models in extracting cognitive exam dates and scores.

作者信息

Zhang Hao, Jethani Neil, Jones Simon, Genes Nicholas, Major Vincent J, Jaffe Ian S, Cardillo Anthony B, Heilenbach Noah, Ali Nadia Fazal, Bonanni Luke J, Clayburn Andrew J, Khera Zain, Sadler Erica C, Prasad Jaideep, Schlacter Jamie, Liu Kevin, Silva Benjamin, Montgomery Sophie, Kim Eric J, Lester Jacob, Hill Theodore M, Avoricani Alba, Chervonski Ethan, Davydov James, Small William, Chakravartty Eesha, Grover Himanshu, Dodson John A, Brody Abraham A, Aphinyanaphongs Yindalon, Masurkar Arjun, Razavian Narges

机构信息

NYU Grossman School of Medicine, New York, New York, United States of America.

Rory Meyers College of Nursing, New York University, New York, New York, United States of America.

出版信息

PLOS Digit Health. 2024 Dec 11;3(12):e0000685. doi: 10.1371/journal.pdig.0000685. eCollection 2024 Dec.

DOI:10.1371/journal.pdig.0000685
PMID:39661652
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11634005/
Abstract

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

摘要

确保大语言模型(LLMs)在临床任务中的可靠性至关重要。我们的研究评估了两种最先进的大语言模型(ChatGPT和LlaMA - 2)用于提取临床信息,重点关注像简易精神状态检查表(MMSE)和临床痴呆评定量表(CDR)这样的认知测试。我们的数据由135,307份临床记录(2010年1月12日至2023年5月24日)组成,这些记录提及了MMSE、CDR或蒙特利尔认知评估量表(MoCA)。应用纳入标准后,剩余34,465份记录,其中765份由ChatGPT(GPT - 4)和LlaMA - 2处理,22位专家对回复进行了评审。ChatGPT成功从742份记录中提取了带有日期的MMSE和CDR实例。我们使用20份记录来微调并训练评审人员。其余722份记录分配给评审人员,其中309份同时分配给两位评审人员。计算了评分者间一致性(Fleiss' Kappa)、精确率、召回率、真/假阴性率和准确率。我们的研究遵循TRIPOD报告指南进行模型验证。对于MMSE信息提取,ChatGPT(与LlaMA - 2相比)的准确率为83%(相比之下LlaMA - 2为66.4%),灵敏度为89.7%(相比之下LlaMA - 2为69.9%),真阴性率为96%(相比之下LlaMA - 2为60.0%),精确率为82.7%(相比之下LlaMA - 2为62.2%)。对于CDR,总体结果较低,准确率为87.1%(相比之下LlaMA - 2为74.5%),灵敏度为84.3%(相比之下LlaMA - 2为39.7%),真阴性率为99.8%(LlaMA - 2为98.4%),精确率为48.3%(相比之下LlaMA - 2为16.1%)。我们对经过双重评审的记录中ChatGPT和LlaMA - 2的MMSE错误进行了定性评估。LlaMA - 2的错误包括27例完全幻觉、19例报告了其他分数而非MMSE分数、25例漏报分数以及23例仅报告了错误日期的情况。相比之下,ChatGPT的错误仅包括3例完全幻觉、17例报告了错误的测试而非MMSE以及19例报告了错误日期的情况。在这项关于ChatGPT和LlaMA - 2从临床记录中提取认知检查日期和分数的诊断/预后研究中,ChatGPT表现出高准确率,与LlaMA - 2相比性能更佳。大语言模型的使用可以通过识别适合进行治疗开始或临床试验入组的患者,从而使痴呆症研究和临床护理受益。对大语言模型进行严格评估对于了解它们的能力和局限性至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5575/11634005/837314b3019f/pdig.0000685.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5575/11634005/837314b3019f/pdig.0000685.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5575/11634005/837314b3019f/pdig.0000685.g001.jpg

相似文献

1
Evaluating Large Language Models in extracting cognitive exam dates and scores.评估大语言模型在提取认知测试日期和分数方面的能力。
PLOS Digit Health. 2024 Dec 11;3(12):e0000685. doi: 10.1371/journal.pdig.0000685. eCollection 2024 Dec.
2
Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores.评估大型语言模型在提取认知测试日期和分数方面的能力。
medRxiv. 2024 Feb 13:2023.07.10.23292373. doi: 10.1101/2023.07.10.23292373.
3
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
4
EYE-Llama, an in-domain large language model for ophthalmology.EYE-Llama,一种用于眼科领域的大语言模型。
bioRxiv. 2024 Apr 29:2024.04.26.591355. doi: 10.1101/2024.04.26.591355.
5
Evaluating the Influence of Role-Playing Prompts on ChatGPT's Misinformation Detection Accuracy: Quantitative Study.评估角色扮演提示对 ChatGPT 错误信息检测准确率的影响:定量研究。
JMIR Infodemiology. 2024 Sep 26;4:e60678. doi: 10.2196/60678.
6
The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.使用大型语言模型(如 ChatGPT、GPT-4 或 Llama)作为临床助手的潜力和陷阱。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1884-1891. doi: 10.1093/jamia/ocae184.
7
Me-LLaMA: Medical Foundation Large Language Models for Comprehensive Text Analysis and Beyond.Me-LLaMA:用于综合文本分析及其他用途的医学基础大语言模型
Res Sq. 2024 Dec 18:rs.3.rs-5456223. doi: 10.21203/rs.3.rs-5456223/v1.
8
Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study.在骨科诊断中评估ChatGPT、Gemini和其他大语言模型:一项前瞻性临床研究。
Comput Struct Biotechnol J. 2024 Dec 26;28:9-15. doi: 10.1016/j.csbj.2024.12.013. eCollection 2025.
9
Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.使用大语言模型从临床文档中提取国际疾病分类代码
Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.
10
Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.大语言模型在多专科招聘评估(MSRA)考试中的表现评估。
Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.

引用本文的文献

1
Scalable Scientific Interest Profiling Using Large Language Models.使用大语言模型进行可扩展的科学兴趣剖析
ArXiv. 2025 Aug 19:arXiv:2508.15834v1.
2
From Text to Data: Automatically Extracting Data From Catheterization Reports Using Generative Artificial Intelligence.从文本到数据:使用生成式人工智能从心导管插入术报告中自动提取数据
J Soc Cardiovasc Angiogr Interv. 2024 Sep 3;4(3Part B):102242. doi: 10.1016/j.jscai.2024.102242. eCollection 2025 Mar.
3
AI modeling for outbreak prediction: A graph-neural-network approach for identifying vancomycin-resistant enterococcus carriers.

本文引用的文献

1
Evaluating the limits of AI in medical specialisation: ChatGPT's performance on the UK Neurology Specialty Certificate Examination.评估人工智能在医学专科领域的局限性:ChatGPT在英国神经学专科证书考试中的表现。
BMJ Neurol Open. 2023 Jun 15;5(1):e000451. doi: 10.1136/bmjno-2023-000451. eCollection 2023.
2
Health system-scale language models are all-purpose prediction engines.健康系统规模的语言模型是通用的预测引擎。
Nature. 2023 Jul;619(7969):357-362. doi: 10.1038/s41586-023-06160-y. Epub 2023 Jun 7.
3
ChatGPT Answers Common Patient Questions About Colonoscopy.
用于疫情预测的人工智能建模:一种用于识别耐万古霉素肠球菌携带者的图神经网络方法。
PLOS Digit Health. 2025 Apr 10;4(4):e0000821. doi: 10.1371/journal.pdig.0000821. eCollection 2025 Apr.
4
Automated structured data extraction from intraoperative echocardiography reports using large language models.使用大语言模型从术中超声心动图报告中自动提取结构化数据
Br J Anaesth. 2025 May;134(5):1308-1317. doi: 10.1016/j.bja.2025.01.028. Epub 2025 Mar 3.
5
ChatGPT4's diagnostic accuracy in inpatient neurology: A retrospective cohort study.ChatGPT4在住院神经内科的诊断准确性:一项回顾性队列研究。
Heliyon. 2024 Dec 9;10(24):e40964. doi: 10.1016/j.heliyon.2024.e40964. eCollection 2024 Dec 30.
6
Predicting Risk of Alzheimer's Diseases and Related Dementias with AI Foundation Model on Electronic Health Records.利用人工智能基础模型基于电子健康记录预测阿尔茨海默病及相关痴呆症的风险
medRxiv. 2024 Apr 27:2024.04.26.24306180. doi: 10.1101/2024.04.26.24306180.
ChatGPT回答患者关于结肠镜检查的常见问题。
Gastroenterology. 2023 Aug;165(2):509-511.e7. doi: 10.1053/j.gastro.2023.04.033. Epub 2023 May 5.
4
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
5
Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT.ChatGPT提供的乳腺癌预防和筛查建议的适宜性。
Radiology. 2023 May;307(4):e230424. doi: 10.1148/radiol.230424. Epub 2023 Apr 4.
6
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
7
The Importance of Proper Use of ChatGPT in Medical Writing.在医学写作中正确使用ChatGPT的重要性。
Radiology. 2023 May;307(3):e230312. doi: 10.1148/radiol.230312. Epub 2023 Mar 7.
8
Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.评估 ChatGPT 在医疗保健中的可行性:对多个临床和研究场景的分析。
J Med Syst. 2023 Mar 4;47(1):33. doi: 10.1007/s10916-023-01925-4.
9
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
10
Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model.从一个基于在线聊天的流行人工智能模型获取的心血管疾病预防建议的适宜性。
JAMA. 2023 Mar 14;329(10):842-844. doi: 10.1001/jama.2023.1044.