Suppr超能文献

使用先进提示技术评估大型语言模型作为医学学习者和临床医生的诊断工具。

Evaluation of large language models as a diagnostic tool for medical learners and clinicians using advanced prompting techniques.

作者信息

Gaebe Karolina, van der Woerd Benjamin

机构信息

Division of Otolaryngology-Head and Neck Surgery, Department of Surgery - Michael G. DeGroote School of Medicine, Hamilton, Ontario, Canada.

出版信息

PLoS One. 2025 Aug 1;20(8):e0325803. doi: 10.1371/journal.pone.0325803. eCollection 2025.

Abstract

BACKGROUND

Large language models (LLMs) have demonstrated capabilities in natural language processing and critical reasoning. Studies investigating their potential use as healthcare diagnostic tools have largely relied on proprietary models like ChatGPT and have not explored the application of advanced prompt engineering techniques. This study aims to evaluate the diagnostic accuracy of three open-source LLMs and the role of prompt engineering using clinical scenarios.

METHODS

We analyzed the performance of three open-source LLMs-llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768-using advanced prompt engineering when answering Medscape Clinical Challenge questions. Responses were recorded and evaluated for correctness, accuracy, precision, specificity, and sensitivity. A sensitivity analysis was conducted presenting the three LLMs with basic prompting challenge questions and excluding cases with visual assets. Results were compared with previously published performance data on GPT-3.5.

RESULTS

Llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768 achieved correct responses in 79%, 65%, and 62% of cases, respectively, outperforming GPT-3.5 (74%). Diagnostic accuracy, precision, sensitivity, and specificity responses all outperformed those previously reported for GPT-3.5. Results generated using advanced prompting strategies were superior to those based on basic prompting. Sensitivity analysis revealed similar trends when cases with visual assets were excluded.

DISCUSSION

Using advanced prompting techniques, LLMs can generate clinically accurate responses. The study highlights the limitations of proprietary models like ChatGPT, particularly in terms of accessibility and reproducibility due to version deprecation. Future research should employ prompt engineering techniques and prioritize the use of open-source models to ensure research replicability.

摘要

背景

大语言模型(LLMs)已在自然语言处理和批判性推理方面展现出能力。调查其作为医疗诊断工具潜在用途的研究很大程度上依赖于ChatGPT等专有模型,且未探索先进提示工程技术的应用。本研究旨在评估三个开源大语言模型的诊断准确性以及使用临床场景进行提示工程的作用。

方法

我们分析了三个开源大语言模型——llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768——在回答Medscape临床挑战问题时使用先进提示工程的表现。记录回答并评估其正确性、准确性、精确性、特异性和敏感性。进行了敏感性分析,向这三个大语言模型呈现基本提示挑战问题,并排除有视觉素材的案例。将结果与先前发表的关于GPT - 3.5的性能数据进行比较。

结果

llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768分别在79%、65%和62%的案例中给出了正确回答,优于GPT - 3.5(74%)。诊断准确性、精确性、敏感性和特异性回答均优于先前报道的GPT - 3.5。使用先进提示策略生成的结果优于基于基本提示的结果。敏感性分析显示,排除有视觉素材的案例时也呈现出类似趋势。

讨论

使用先进提示技术,大语言模型可以生成临床准确的回答。该研究突出了ChatGPT等专有模型的局限性,特别是由于版本弃用在可及性和可重复性方面的局限。未来研究应采用提示工程技术,并优先使用开源模型以确保研究的可复制性。

相似文献

本文引用的文献

9
ChatGPT in healthcare: A taxonomy and systematic review.ChatGPT 在医疗保健中的应用:分类法与系统综述。
Comput Methods Programs Biomed. 2024 Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013. Epub 2024 Jan 15.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验