Yadav Shashank, Maughan David, Subbian Vignesh
College of Engineering, The University of Arizona, Tucson, AZ.
ArXiv. 2025 Jul 30:arXiv:2507.23146v1.
Although computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications, it is time-intensive because of manual data review. We previously assessed the ability of LLMs to perform computational phenotyping tasks using computable phenotypes for ARF respiratory support therapies. They successfully performed concept classification and classification of single-therapy phenotypes, but underperformed on multiple-therapy phenotypes. To understand issues with these complex tasks, we expanded PHEONA, a generalizable framework for evaluation of LLMs, to include methods specifically for evaluating faulty reasoning.
We assessed the responses of three lightweight LLMs (DeepSeek-r1 32 billion, Mistral Small 24 billion, and Phi-4 14 billion) both with and without prompt modifications to identify explanation correctness and unfaithfulness errors for phenotyping.
For experiments without prompt modifications, both errors were present across all models although more responses had explanation correctness errors than unfaithfulness errors. For experiments assessing accuracy impact after prompt modifications, DeepSeek, a reasoning model, had the smallest overall accuracy impact when compared to Mistral and Phi.
Since reasoning errors were ubiquitous across models, our enhancement of PHEONA to include a component for assessing faulty reasoning provides critical support for LLM evaluation and evidence for reasoning errors for complex tasks. While insights from reasoning errors can help prompt refinement, a deeper understanding of why LLM reasoning errors occur will likely require further development and refinement of interpretability methods.
Reasoning errors were pervasive across LLM responses for computational phenotyping, a complex reasoning task.
尽管计算表型分析是一项核心信息学活动,所产生的队列支持广泛的应用,但由于需要人工审核数据,所以耗时较长。我们之前评估了大语言模型(LLMs)使用急性呼吸窘迫综合征(ARDS)呼吸支持疗法的可计算表型来执行计算表型分析任务的能力。它们成功地完成了概念分类和单一疗法表型的分类,但在多疗法表型方面表现欠佳。为了理解这些复杂任务中存在的问题,我们扩展了PHEONA(一个用于评估大语言模型的通用框架),使其包括专门用于评估错误推理的方法。
我们评估了三个轻量级大语言模型(拥有320亿参数的DeepSeek-r1、拥有240亿参数的米斯特拉尔小型模型和拥有140亿参数的Phi-4)在有无提示修改情况下的回答,以识别表型分析中的解释正确性和不忠实性错误。
对于未进行提示修改的实验,所有模型中都存在这两种错误,不过有解释正确性错误的回答比有不忠实性错误的回答更多。对于评估提示修改后准确性影响的实验,与米斯特拉尔和Phi相比,推理模型DeepSeek的总体准确性影响最小。
由于推理错误在各个模型中普遍存在,我们对PHEONA进行增强,使其包含一个用于评估错误推理的组件,这为大语言模型评估提供了关键支持,并为复杂任务中的推理错误提供了证据。虽然从推理错误中获得的见解有助于改进提示,但要更深入地理解大语言模型推理错误发生的原因,可能需要进一步开发和完善可解释性方法。
在计算表型分析(一项复杂的推理任务)的大语言模型回答中,推理错误普遍存在。