• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

上下文匹配并非推理:评估生成式语言模型在临床环境中的广义评估

Context Matching is not Reasoning: Assessing Generalized Evaluation of Generative Language Models in Clinical Settings.

作者信息

Wen Andrew, Lu Qiuhao, Chuang Yu-Neng, Wang Guanchu, Yuan Jiayi, Zhang Jiamu, Wang Liwei, Fu Sunyang, Miller Kurt D, Jia Heling, Bedrick Steven D, Hersh William R, Roberts Kirk E, Hu Xia, Liu Hongfang

机构信息

The University of Texas Health Science Center at Houston.

Rice University.

出版信息

Res Sq. 2025 Aug 29:rs.3.rs-7325383. doi: 10.21203/rs.3.rs-7325383/v1.

DOI:10.21203/rs.3.rs-7325383/v1
PMID:40909787
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12408041/
Abstract

Current discussion surrounding the clinical capabilities of generative language models (GLMs) predominantly center around multiple-choice question-answer (MCQA) benchmarks derived from clinical licensing examinations. While accepted for human examinees, characteristics unique to GLMs bring into question the validity of such benchmarks. Here, we validate four benchmarks using eight GLMs, ablating for parameter size and reasoning capabilities, validating via prompt permutation three key assumptions that underpin the generalizability of MCQA-based assessments: that knowledge is applied, not memorized, that semantic consistency will lead to consistent answers, and that situations with no answers can be recognized. While large models are more resilient to our perturbations compared to small models, we globally invalidate these assumptions, with implications for reasoning models. Additionally, despite retaining the knowledge, small models are prone to answer memorization. All models exhibit significant failure in null-answer scenarios. We then suggest several adaptations for more robust benchmark designs more reflective of real-world conditions.

摘要

当前围绕生成式语言模型(GLMs)临床能力的讨论主要集中在源自临床许可考试的多项选择题问答(MCQA)基准上。虽然这些基准适用于人类考生,但GLMs的独特特性使此类基准的有效性受到质疑。在此,我们使用八个GLMs对四个基准进行验证,针对参数大小和推理能力进行消融,通过提示排列验证支撑基于MCQA评估可推广性的三个关键假设:知识是被应用而非记忆的,语义一致性将导致一致的答案,以及可以识别无答案的情况。与小模型相比,大模型对我们的扰动更具弹性,但我们从整体上否定了这些假设,这对推理模型具有启示意义。此外,尽管小模型保留了知识,但它们容易出现答案记忆的情况。所有模型在无答案场景中都表现出显著的失败。然后,我们提出了几种调整方法,以设计出更能反映现实世界情况的、更稳健的基准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/7fd0530ffb32/nihpp-rs7325383v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/861ad841b300/nihpp-rs7325383v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/3737286d9322/nihpp-rs7325383v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/80ad69dd9bcc/nihpp-rs7325383v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/7fd0530ffb32/nihpp-rs7325383v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/861ad841b300/nihpp-rs7325383v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/3737286d9322/nihpp-rs7325383v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/80ad69dd9bcc/nihpp-rs7325383v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b56d/12408041/7fd0530ffb32/nihpp-rs7325383v1-f0004.jpg

相似文献

1
Context Matching is not Reasoning: Assessing Generalized Evaluation of Generative Language Models in Clinical Settings.上下文匹配并非推理:评估生成式语言模型在临床环境中的广义评估
Res Sq. 2025 Aug 29:rs.3.rs-7325383. doi: 10.21203/rs.3.rs-7325383/v1.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Maternal and neonatal outcomes of elective induction of labor.择期引产的母婴结局
Evid Rep Technol Assess (Full Rep). 2009 Mar(176):1-257.
4
A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。
Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.
5
Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials.与随机试验中评估的医疗保健结果相比,观察性研究设计评估的医疗保健结果。
Cochrane Database Syst Rev. 2014 Apr 29;2014(4):MR000034. doi: 10.1002/14651858.MR000034.pub2.
6
Short-Term Memory Impairment短期记忆障碍
7
Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.基于树的方法的即插即用:对临床预测模型的影响。
J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.
8
Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义
APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.
9
MarkVCID cerebral small vessel consortium: I. Enrollment, clinical, fluid protocols.马克 VCID 脑小血管联盟:一、入组、临床、液体方案。
Alzheimers Dement. 2021 Apr;17(4):704-715. doi: 10.1002/alz.12215. Epub 2021 Jan 21.
10
Sexual Harassment and Prevention Training性骚扰与预防培训

本文引用的文献

1
Small language models learn enhanced reasoning skills from medical textbooks.小型语言模型从医学教科书中学习增强的推理技能。
NPJ Digit Med. 2025 May 2;8(1):240. doi: 10.1038/s41746-025-01653-8.
2
The rise of agentic AI teammates in medicine.医学领域中具有自主性的人工智能队友的兴起。
Lancet. 2025 Feb 8;405(10477):457. doi: 10.1016/S0140-6736(25)00202-8.
3
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.大语言模型在医疗保健应用中的测试与评估:一项系统综述。
JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.
4
Utilizing active learning strategies in machine-assisted annotation for clinical named entity recognition: a comprehensive analysis considering annotation costs and target effectiveness.利用主动学习策略在机器辅助标注中进行临床命名实体识别:考虑标注成本和目标效果的综合分析。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2632-2640. doi: 10.1093/jamia/ocae197.
5
Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.考察 ChatGPT 在 USMLE 样题上的表现及对评估的启示
Acad Med. 2024 Feb 1;99(2):192-197. doi: 10.1097/ACM.0000000000005549. Epub 2023 Nov 7.
6
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
7
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
8
Cost-aware active learning for named entity recognition in clinical text.基于成本意识的临床文本命名实体识别的主动学习。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1314-1322. doi: 10.1093/jamia/ocz102.
9
The evolution of the United States Medical Licensing Examination (USMLE): enhancing assessment of practice-related competencies.美国医师执照考试(USMLE)的演变:加强对实践相关能力的评估
JAMA. 2013 Dec 4;310(21):2245-6. doi: 10.1001/jama.2013.282328.