Wen Andrew, Lu Qiuhao, Chuang Yu-Neng, Wang Guanchu, Yuan Jiayi, Zhang Jiamu, Wang Liwei, Fu Sunyang, Miller Kurt D, Jia Heling, Bedrick Steven D, Hersh William R, Roberts Kirk E, Hu Xia, Liu Hongfang
The University of Texas Health Science Center at Houston.
Rice University.
Res Sq. 2025 Aug 29:rs.3.rs-7325383. doi: 10.21203/rs.3.rs-7325383/v1.
Current discussion surrounding the clinical capabilities of generative language models (GLMs) predominantly center around multiple-choice question-answer (MCQA) benchmarks derived from clinical licensing examinations. While accepted for human examinees, characteristics unique to GLMs bring into question the validity of such benchmarks. Here, we validate four benchmarks using eight GLMs, ablating for parameter size and reasoning capabilities, validating via prompt permutation three key assumptions that underpin the generalizability of MCQA-based assessments: that knowledge is applied, not memorized, that semantic consistency will lead to consistent answers, and that situations with no answers can be recognized. While large models are more resilient to our perturbations compared to small models, we globally invalidate these assumptions, with implications for reasoning models. Additionally, despite retaining the knowledge, small models are prone to answer memorization. All models exhibit significant failure in null-answer scenarios. We then suggest several adaptations for more robust benchmark designs more reflective of real-world conditions.
当前围绕生成式语言模型(GLMs)临床能力的讨论主要集中在源自临床许可考试的多项选择题问答(MCQA)基准上。虽然这些基准适用于人类考生,但GLMs的独特特性使此类基准的有效性受到质疑。在此,我们使用八个GLMs对四个基准进行验证,针对参数大小和推理能力进行消融,通过提示排列验证支撑基于MCQA评估可推广性的三个关键假设:知识是被应用而非记忆的,语义一致性将导致一致的答案,以及可以识别无答案的情况。与小模型相比,大模型对我们的扰动更具弹性,但我们从整体上否定了这些假设,这对推理模型具有启示意义。此外,尽管小模型保留了知识,但它们容易出现答案记忆的情况。所有模型在无答案场景中都表现出显著的失败。然后,我们提出了几种调整方法,以设计出更能反映现实世界情况的、更稳健的基准。