Suppr超能文献

轻量级语言模型在复杂的计算表型分析任务中容易出现推理错误。

Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks.

作者信息

Yadav Shashank, Maughan David, Subbian Vignesh

机构信息

College of Engineering, The University of Arizona, Tucson, AZ.

出版信息

ArXiv. 2025 Jul 30:arXiv:2507.23146v1.

Abstract

OBJECTIVE

Although computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications, it is time-intensive because of manual data review. We previously assessed the ability of LLMs to perform computational phenotyping tasks using computable phenotypes for ARF respiratory support therapies. They successfully performed concept classification and classification of single-therapy phenotypes, but underperformed on multiple-therapy phenotypes. To understand issues with these complex tasks, we expanded PHEONA, a generalizable framework for evaluation of LLMs, to include methods specifically for evaluating faulty reasoning.

MATERIALS AND METHODS

We assessed the responses of three lightweight LLMs (DeepSeek-r1 32 billion, Mistral Small 24 billion, and Phi-4 14 billion) both with and without prompt modifications to identify explanation correctness and unfaithfulness errors for phenotyping.

RESULTS

For experiments without prompt modifications, both errors were present across all models although more responses had explanation correctness errors than unfaithfulness errors. For experiments assessing accuracy impact after prompt modifications, DeepSeek, a reasoning model, had the smallest overall accuracy impact when compared to Mistral and Phi.

DISCUSSION

Since reasoning errors were ubiquitous across models, our enhancement of PHEONA to include a component for assessing faulty reasoning provides critical support for LLM evaluation and evidence for reasoning errors for complex tasks. While insights from reasoning errors can help prompt refinement, a deeper understanding of why LLM reasoning errors occur will likely require further development and refinement of interpretability methods.

CONCLUSION

Reasoning errors were pervasive across LLM responses for computational phenotyping, a complex reasoning task.

摘要

目的

尽管计算表型分析是一项核心信息学活动,所产生的队列支持广泛的应用,但由于需要人工审核数据,所以耗时较长。我们之前评估了大语言模型(LLMs)使用急性呼吸窘迫综合征(ARDS)呼吸支持疗法的可计算表型来执行计算表型分析任务的能力。它们成功地完成了概念分类和单一疗法表型的分类,但在多疗法表型方面表现欠佳。为了理解这些复杂任务中存在的问题,我们扩展了PHEONA(一个用于评估大语言模型的通用框架),使其包括专门用于评估错误推理的方法。

材料和方法

我们评估了三个轻量级大语言模型(拥有320亿参数的DeepSeek-r1、拥有240亿参数的米斯特拉尔小型模型和拥有140亿参数的Phi-4)在有无提示修改情况下的回答,以识别表型分析中的解释正确性和不忠实性错误。

结果

对于未进行提示修改的实验,所有模型中都存在这两种错误,不过有解释正确性错误的回答比有不忠实性错误的回答更多。对于评估提示修改后准确性影响的实验,与米斯特拉尔和Phi相比,推理模型DeepSeek的总体准确性影响最小。

讨论

由于推理错误在各个模型中普遍存在,我们对PHEONA进行增强,使其包含一个用于评估错误推理的组件,这为大语言模型评估提供了关键支持,并为复杂任务中的推理错误提供了证据。虽然从推理错误中获得的见解有助于改进提示,但要更深入地理解大语言模型推理错误发生的原因,可能需要进一步开发和完善可解释性方法。

结论

在计算表型分析(一项复杂的推理任务)的大语言模型回答中,推理错误普遍存在。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/2fb0619c5923/nihpp-2507.23146v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验