Gaebe Karolina, van der Woerd Benjamin
Division of Otolaryngology-Head and Neck Surgery, Department of Surgery - Michael G. DeGroote School of Medicine, Hamilton, Ontario, Canada.
PLoS One. 2025 Aug 1;20(8):e0325803. doi: 10.1371/journal.pone.0325803. eCollection 2025.
Large language models (LLMs) have demonstrated capabilities in natural language processing and critical reasoning. Studies investigating their potential use as healthcare diagnostic tools have largely relied on proprietary models like ChatGPT and have not explored the application of advanced prompt engineering techniques. This study aims to evaluate the diagnostic accuracy of three open-source LLMs and the role of prompt engineering using clinical scenarios.
We analyzed the performance of three open-source LLMs-llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768-using advanced prompt engineering when answering Medscape Clinical Challenge questions. Responses were recorded and evaluated for correctness, accuracy, precision, specificity, and sensitivity. A sensitivity analysis was conducted presenting the three LLMs with basic prompting challenge questions and excluding cases with visual assets. Results were compared with previously published performance data on GPT-3.5.
Llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768 achieved correct responses in 79%, 65%, and 62% of cases, respectively, outperforming GPT-3.5 (74%). Diagnostic accuracy, precision, sensitivity, and specificity responses all outperformed those previously reported for GPT-3.5. Results generated using advanced prompting strategies were superior to those based on basic prompting. Sensitivity analysis revealed similar trends when cases with visual assets were excluded.
Using advanced prompting techniques, LLMs can generate clinically accurate responses. The study highlights the limitations of proprietary models like ChatGPT, particularly in terms of accessibility and reproducibility due to version deprecation. Future research should employ prompt engineering techniques and prioritize the use of open-source models to ensure research replicability.
大语言模型(LLMs)已在自然语言处理和批判性推理方面展现出能力。调查其作为医疗诊断工具潜在用途的研究很大程度上依赖于ChatGPT等专有模型,且未探索先进提示工程技术的应用。本研究旨在评估三个开源大语言模型的诊断准确性以及使用临床场景进行提示工程的作用。
我们分析了三个开源大语言模型——llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768——在回答Medscape临床挑战问题时使用先进提示工程的表现。记录回答并评估其正确性、准确性、精确性、特异性和敏感性。进行了敏感性分析,向这三个大语言模型呈现基本提示挑战问题,并排除有视觉素材的案例。将结果与先前发表的关于GPT - 3.5的性能数据进行比较。
llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768分别在79%、65%和62%的案例中给出了正确回答,优于GPT - 3.5(74%)。诊断准确性、精确性、敏感性和特异性回答均优于先前报道的GPT - 3.5。使用先进提示策略生成的结果优于基于基本提示的结果。敏感性分析显示,排除有视觉素材的案例时也呈现出类似趋势。
使用先进提示技术,大语言模型可以生成临床准确的回答。该研究突出了ChatGPT等专有模型的局限性,特别是由于版本弃用在可及性和可重复性方面的局限。未来研究应采用提示工程技术,并优先使用开源模型以确保研究的可复制性。