Suppr超能文献

使用先进提示技术评估大型语言模型作为医学学习者和临床医生的诊断工具。

Evaluation of large language models as a diagnostic tool for medical learners and clinicians using advanced prompting techniques.

作者信息

Gaebe Karolina, van der Woerd Benjamin

机构信息

Division of Otolaryngology-Head and Neck Surgery, Department of Surgery - Michael G. DeGroote School of Medicine, Hamilton, Ontario, Canada.

出版信息

PLoS One. 2025 Aug 1;20(8):e0325803. doi: 10.1371/journal.pone.0325803. eCollection 2025.

Abstract

BACKGROUND

Large language models (LLMs) have demonstrated capabilities in natural language processing and critical reasoning. Studies investigating their potential use as healthcare diagnostic tools have largely relied on proprietary models like ChatGPT and have not explored the application of advanced prompt engineering techniques. This study aims to evaluate the diagnostic accuracy of three open-source LLMs and the role of prompt engineering using clinical scenarios.

METHODS

We analyzed the performance of three open-source LLMs-llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768-using advanced prompt engineering when answering Medscape Clinical Challenge questions. Responses were recorded and evaluated for correctness, accuracy, precision, specificity, and sensitivity. A sensitivity analysis was conducted presenting the three LLMs with basic prompting challenge questions and excluding cases with visual assets. Results were compared with previously published performance data on GPT-3.5.

RESULTS

Llama-3.1-70b-versatile, llama-3.1-8b-instant, and mixtral-8x7b-32768 achieved correct responses in 79%, 65%, and 62% of cases, respectively, outperforming GPT-3.5 (74%). Diagnostic accuracy, precision, sensitivity, and specificity responses all outperformed those previously reported for GPT-3.5. Results generated using advanced prompting strategies were superior to those based on basic prompting. Sensitivity analysis revealed similar trends when cases with visual assets were excluded.

DISCUSSION

Using advanced prompting techniques, LLMs can generate clinically accurate responses. The study highlights the limitations of proprietary models like ChatGPT, particularly in terms of accessibility and reproducibility due to version deprecation. Future research should employ prompt engineering techniques and prioritize the use of open-source models to ensure research replicability.

摘要

背景

大语言模型(LLMs)已在自然语言处理和批判性推理方面展现出能力。调查其作为医疗诊断工具潜在用途的研究很大程度上依赖于ChatGPT等专有模型,且未探索先进提示工程技术的应用。本研究旨在评估三个开源大语言模型的诊断准确性以及使用临床场景进行提示工程的作用。

方法

我们分析了三个开源大语言模型——llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768——在回答Medscape临床挑战问题时使用先进提示工程的表现。记录回答并评估其正确性、准确性、精确性、特异性和敏感性。进行了敏感性分析,向这三个大语言模型呈现基本提示挑战问题,并排除有视觉素材的案例。将结果与先前发表的关于GPT - 3.5的性能数据进行比较。

结果

llama - 3.1 - 70b - versatile、llama - 3.1 - 8b - instant和mixtral - 8x7b - 32768分别在79%、65%和62%的案例中给出了正确回答,优于GPT - 3.5(74%)。诊断准确性、精确性、敏感性和特异性回答均优于先前报道的GPT - 3.5。使用先进提示策略生成的结果优于基于基本提示的结果。敏感性分析显示,排除有视觉素材的案例时也呈现出类似趋势。

讨论

使用先进提示技术,大语言模型可以生成临床准确的回答。该研究突出了ChatGPT等专有模型的局限性,特别是由于版本弃用在可及性和可重复性方面的局限。未来研究应采用提示工程技术,并优先使用开源模型以确保研究的可复制性。

相似文献

1
Evaluation of large language models as a diagnostic tool for medical learners and clinicians using advanced prompting techniques.
PLoS One. 2025 Aug 1;20(8):e0325803. doi: 10.1371/journal.pone.0325803. eCollection 2025.
4
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.
J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.
10
Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases.
Int J Med Inform. 2025 Jun 25;203:106026. doi: 10.1016/j.ijmedinf.2025.106026.

本文引用的文献

1
Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy.
Aesthetic Plast Surg. 2025 Apr;49(7):1868-1873. doi: 10.1007/s00266-024-04343-0. Epub 2024 Sep 16.
2
Large Language Model Prompting Techniques for Advancement in Clinical Medicine.
J Clin Med. 2024 Aug 28;13(17):5101. doi: 10.3390/jcm13175101.
3
Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians.
PLoS One. 2024 Jul 31;19(7):e0307383. doi: 10.1371/journal.pone.0307383. eCollection 2024.
4
Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds.
Diagnostics (Basel). 2024 Jul 17;14(14):1541. doi: 10.3390/diagnostics14141541.
5
Large language models in health care: Development, applications, and challenges.
Health Care Sci. 2023 Jul 24;2(4):255-263. doi: 10.1002/hcs2.61. eCollection 2023 Aug.
6
Urology consultants versus large language models: Potentials and hazards for medical advice in urology.
BJUI Compass. 2024 Apr 3;5(5):438-444. doi: 10.1002/bco2.359. eCollection 2024 May.
8
Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.
NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.
9
ChatGPT in healthcare: A taxonomy and systematic review.
Comput Methods Programs Biomed. 2024 Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013. Epub 2024 Jan 15.
10
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验