• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

轻量级语言模型在复杂的计算表型分析任务中容易出现推理错误。

Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks.

作者信息

Yadav Shashank, Maughan David, Subbian Vignesh

机构信息

College of Engineering, The University of Arizona, Tucson, AZ.

出版信息

ArXiv. 2025 Jul 30:arXiv:2507.23146v1.

PMID:40766892
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12324558/
Abstract

OBJECTIVE

Although computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications, it is time-intensive because of manual data review. We previously assessed the ability of LLMs to perform computational phenotyping tasks using computable phenotypes for ARF respiratory support therapies. They successfully performed concept classification and classification of single-therapy phenotypes, but underperformed on multiple-therapy phenotypes. To understand issues with these complex tasks, we expanded PHEONA, a generalizable framework for evaluation of LLMs, to include methods specifically for evaluating faulty reasoning.

MATERIALS AND METHODS

We assessed the responses of three lightweight LLMs (DeepSeek-r1 32 billion, Mistral Small 24 billion, and Phi-4 14 billion) both with and without prompt modifications to identify explanation correctness and unfaithfulness errors for phenotyping.

RESULTS

For experiments without prompt modifications, both errors were present across all models although more responses had explanation correctness errors than unfaithfulness errors. For experiments assessing accuracy impact after prompt modifications, DeepSeek, a reasoning model, had the smallest overall accuracy impact when compared to Mistral and Phi.

DISCUSSION

Since reasoning errors were ubiquitous across models, our enhancement of PHEONA to include a component for assessing faulty reasoning provides critical support for LLM evaluation and evidence for reasoning errors for complex tasks. While insights from reasoning errors can help prompt refinement, a deeper understanding of why LLM reasoning errors occur will likely require further development and refinement of interpretability methods.

CONCLUSION

Reasoning errors were pervasive across LLM responses for computational phenotyping, a complex reasoning task.

摘要

目的

尽管计算表型分析是一项核心信息学活动,所产生的队列支持广泛的应用,但由于需要人工审核数据,所以耗时较长。我们之前评估了大语言模型(LLMs)使用急性呼吸窘迫综合征(ARDS)呼吸支持疗法的可计算表型来执行计算表型分析任务的能力。它们成功地完成了概念分类和单一疗法表型的分类,但在多疗法表型方面表现欠佳。为了理解这些复杂任务中存在的问题,我们扩展了PHEONA(一个用于评估大语言模型的通用框架),使其包括专门用于评估错误推理的方法。

材料和方法

我们评估了三个轻量级大语言模型(拥有320亿参数的DeepSeek-r1、拥有240亿参数的米斯特拉尔小型模型和拥有140亿参数的Phi-4)在有无提示修改情况下的回答,以识别表型分析中的解释正确性和不忠实性错误。

结果

对于未进行提示修改的实验,所有模型中都存在这两种错误,不过有解释正确性错误的回答比有不忠实性错误的回答更多。对于评估提示修改后准确性影响的实验,与米斯特拉尔和Phi相比,推理模型DeepSeek的总体准确性影响最小。

讨论

由于推理错误在各个模型中普遍存在,我们对PHEONA进行增强,使其包含一个用于评估错误推理的组件,这为大语言模型评估提供了关键支持,并为复杂任务中的推理错误提供了证据。虽然从推理错误中获得的见解有助于改进提示,但要更深入地理解大语言模型推理错误发生的原因,可能需要进一步开发和完善可解释性方法。

结论

在计算表型分析(一项复杂的推理任务)的大语言模型回答中,推理错误普遍存在。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/8ad71e597382/nihpp-2507.23146v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/2fb0619c5923/nihpp-2507.23146v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/7fca124480fc/nihpp-2507.23146v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/d71868fee891/nihpp-2507.23146v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/8ad71e597382/nihpp-2507.23146v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/2fb0619c5923/nihpp-2507.23146v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/7fca124480fc/nihpp-2507.23146v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/d71868fee891/nihpp-2507.23146v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bdf/12324558/8ad71e597382/nihpp-2507.23146v1-f0004.jpg

相似文献

1
Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks.轻量级语言模型在复杂的计算表型分析任务中容易出现推理错误。
ArXiv. 2025 Jul 30:arXiv:2507.23146v1.
2
SHREC: A Framework for Advancing Next-Generation Computational Phenotyping with Large Language Models.SHREC:一个利用大语言模型推进下一代计算表型分析的框架。
ArXiv. 2025 Jul 17:arXiv:2506.16359v3.
3
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
4
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
5
Sexual Harassment and Prevention Training性骚扰与预防培训
6
Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。
J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.
7
Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.通过在出院小结中添加重点内容提高大语言模型的总结准确性:比较评估
JMIR Med Inform. 2025 Jul 24;13:e66476. doi: 10.2196/66476.
8
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
9
Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.大语言模型中的医学推理:对DeepSeek R1的深入分析
Front Artif Intell. 2025 Jun 18;8:1616145. doi: 10.3389/frai.2025.1616145. eCollection 2025.
10
Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验:定性证据综合。
Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

本文引用的文献

1
Towards automated phenotype definition extraction using large language models.迈向使用大语言模型进行自动化表型定义提取
Genomics Inform. 2024 Oct 31;22(1):21. doi: 10.1186/s44342-024-00023-2.
2
A general framework for developing computable clinical phenotype algorithms.开发可计算临床表型算法的一般框架。
J Am Med Inform Assoc. 2024 Aug 1;31(8):1785-1796. doi: 10.1093/jamia/ocae121.
3
Rule-Based Cohort Definitions for Acute Respiratory Failure: Electronic Phenotyping Algorithm.急性呼吸衰竭基于规则的队列定义:电子表型算法
JMIR Med Inform. 2020 Apr 15;8(4):e18402. doi: 10.2196/18402.
4
Making work visible for electronic phenotype implementation: Lessons learned from the eMERGE network.实现电子表型的工作可视化:从 eMERGE 网络中获得的经验教训。
J Biomed Inform. 2019 Nov;99:103293. doi: 10.1016/j.jbi.2019.103293. Epub 2019 Sep 19.
5
Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models.电子表型分析的进展:从基于规则的定义到机器学习模型
Annu Rev Biomed Data Sci. 2018 Jul;1:53-68. doi: 10.1146/annurev-biodatasci-080917-013315. Epub 2018 May 23.
6
The eICU Collaborative Research Database, a freely available multi-center database for critical care research.eICU 协作研究数据库,一个免费的多中心重症监护研究数据库。
Sci Data. 2018 Sep 11;5:180178. doi: 10.1038/sdata.2018.178.