使用先进提示技术评估大型语言模型作为医学学习者和临床医生的诊断工具。

Evaluation of large language models as a diagnostic tool for medical learners and clinicians using advanced prompting techniques.

作者信息

Gaebe Karolina, van der Woerd Benjamin

机构信息

Division of Otolaryngology-Head and Neck Surgery, Department of Surgery - Michael G. DeGroote School of Medicine, Hamilton, Ontario, Canada.

出版信息

PLoS One. 2025 Aug 1;20(8):e0325803. doi: 10.1371/journal.pone.0325803. eCollection 2025.

DOI:10.1371/journal.pone.0325803

PMID:40749008

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12316197/

Abstract

BACKGROUND

Large language models (LLMs) have demonstrated capabilities in natural language processing and critical reasoning. Studies investigating their potential use as healthcare diagnostic tools have largely relied on proprietary models like ChatGPT and have not explored the application of advanced prompt engineering techniques. This study aims to evaluate the diagnostic accuracy of three open-source LLMs and the role of prompt engineering using clinical scenarios.

METHODS

We analyzed the performance of three open-source LLMs-llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768-using advanced prompt engineering when answering Medscape Clinical Challenge questions. Responses were recorded and evaluated for correctness, accuracy, precision, specificity, and sensitivity. A sensitivity analysis was conducted presenting the three LLMs with basic prompting challenge questions and excluding cases with visual assets. Results were compared with previously published performance data on GPT-3.5.

RESULTS

Llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768 achieved correct responses in 79%, 65%, and 62% of cases, respectively, outperforming GPT-3.5 (74%). Diagnostic accuracy, precision, sensitivity, and specificity responses all outperformed those previously reported for GPT-3.5. Results generated using advanced prompting strategies were superior to those based on basic prompting. Sensitivity analysis revealed similar trends when cases with visual assets were excluded.

DISCUSSION

Using advanced prompting techniques, LLMs can generate clinically accurate responses. The study highlights the limitations of proprietary models like ChatGPT, particularly in terms of accessibility and reproducibility due to version deprecation. Future research should employ prompt engineering techniques and prioritize the use of open-source models to ensure research replicability.

摘要

背景

大语言模型（LLMs）已在自然语言处理和批判性推理方面展现出能力。调查其作为医疗诊断工具潜在用途的研究很大程度上依赖于ChatGPT等专有模型，且未探索先进提示工程技术的应用。本研究旨在评估三个开源大语言模型的诊断准确性以及使用临床场景进行提示工程的作用。

方法

我们分析了三个开源大语言模型——llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768——在回答Medscape临床挑战问题时使用先进提示工程的表现。记录回答并评估其正确性、准确性、精确性、特异性和敏感性。进行了敏感性分析，向这三个大语言模型呈现基本提示挑战问题，并排除有视觉素材的案例。将结果与先前发表的关于GPT - 3.5的性能数据进行比较。

结果

llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768分别在79%、65%和62%的案例中给出了正确回答，优于GPT - 3.5（74%）。诊断准确性、精确性、敏感性和特异性回答均优于先前报道的GPT - 3.5。使用先进提示策略生成的结果优于基于基本提示的结果。敏感性分析显示，排除有视觉素材的案例时也呈现出类似趋势。