Suppr超能文献

大型语言模型能对医学问题进行推理吗?

Can large language models reason about medical questions?

作者信息

Liévin Valentin, Hother Christoffer Egeberg, Motzfeldt Andreas Geert, Winther Ole

机构信息

Section for Cognitive Systems, Technical University of Denmark, Anker Engelunds Vej 101, 2800 Kongens Lyngby, Denmark.

FindZebra, Rådvadsvej 36, 2400 Copenhagen, Denmark.

出版信息

Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.

Abstract

Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

摘要

尽管大语言模型常常能给出令人印象深刻的输出结果,但在需要强大推理能力和专业领域知识的现实场景中,它们的表现仍不明朗。我们着手研究封闭源模型和开源模型(如GPT-3.5、Llama 2等)是否可用于回答基于现实世界的难题并进行推理。我们聚焦于三个流行的医学基准测试(美国医学执照考试[USMLE]的MedQA、MedMCQA和PubMedQA)以及多种提示场景:思维链(CoT;逐步思考)、少样本学习和检索增强。基于对生成的思维链的专家注释,我们发现InstructGPT常常能够读取、推理并回忆专家知识。最后,通过利用提示工程(少样本学习和集成方法)的进展,我们证明GPT-3.5不仅能产生校准后的预测分布,还在三个数据集上达到了及格分数:MedQA-USMLE(60.2%)、MedMCQA(62.7%)和PubMedQA(78.2%)。开源模型正在缩小差距:Llama 2 70B在MedQA-USMLE上的准确率也达到了62.5%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f405/10935498/7cac6340ebf3/fx1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验