文献检索，用中文搜 PubMed

应用&插件

Zotero 插件浏览器插件 Mac 客户端 Windows 客户端微信小程序

定价

高级版会员购买积分包购买API积分包

服务

文献检索文档翻译深度研究 API 文档 MCP 服务

关于我们

关于 Suppr 公司介绍联系我们用户协议隐私条款

关注我们

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

粤ICP备2023148730 号-1Suppr @ 2026

OBJECTIVE

Although computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications, it is time-intensive because of manual data review. We previously assessed the ability of LLMs to perform computational phenotyping tasks using computable phenotypes for ARF respiratory support therapies. They successfully performed concept classification and classification of single-therapy phenotypes, but underperformed on multiple-therapy phenotypes. To understand issues with these complex tasks, we expanded PHEONA, a generalizable framework for evaluation of LLMs, to include methods specifically for evaluating faulty reasoning.

MATERIALS AND METHODS

We assessed the responses of three lightweight LLMs (DeepSeek-r1 32 billion, Mistral Small 24 billion, and Phi-4 14 billion) both with and without prompt modifications to identify explanation correctness and unfaithfulness errors for phenotyping.

RESULTS

For experiments without prompt modifications, both errors were present across all models although more responses had explanation correctness errors than unfaithfulness errors. For experiments assessing accuracy impact after prompt modifications, DeepSeek, a reasoning model, had the smallest overall accuracy impact when compared to Mistral and Phi.

DISCUSSION

Since reasoning errors were ubiquitous across models, our enhancement of PHEONA to include a component for assessing faulty reasoning provides critical support for LLM evaluation and evidence for reasoning errors for complex tasks. While insights from reasoning errors can help prompt refinement, a deeper understanding of why LLM reasoning errors occur will likely require further development and refinement of interpretability methods.

CONCLUSION

Reasoning errors were pervasive across LLM responses for computational phenotyping, a complex reasoning task.

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

轻量级语言模型在复杂的计算表型分析任务中容易出现推理错误。

Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料和方法

结果

讨论

结论

相似文献

本文引用的文献

轻量级语言模型在复杂的计算表型分析任务中容易出现推理错误。

Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料和方法

结果

讨论

结论

相似文献

本文引用的文献