使用大型语言模型（如 ChatGPT、GPT-4 或 Llama）作为临床助手的潜力和陷阱。

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

机构信息

Pangaea Data Limited, London, SE1 7LY, United Kingdom.

Data Science Institute, Imperial College London, London, SW7 2AZ, United Kingdom.

出版信息

J Am Med Inform Assoc. 2024 Sep 1;31(9):1884-1891. doi: 10.1093/jamia/ocae184.

DOI:10.1093/jamia/ocae184

PMID:39018498

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11339517/

Abstract

OBJECTIVES

This study aims to evaluate the utility of large language models (LLMs) in healthcare, focusing on their applications in enhancing patient care through improved diagnostic, decision-making processes, and as ancillary tools for healthcare professionals.

MATERIALS AND METHODS

We evaluated ChatGPT, GPT-4, and LLaMA in identifying patients with specific diseases using gold-labeled Electronic Health Records (EHRs) from the MIMIC-III database, covering three prevalent diseases-Chronic Obstructive Pulmonary Disease (COPD), Chronic Kidney Disease (CKD)-along with the rare condition, Primary Biliary Cirrhosis (PBC), and the hard-to-diagnose condition Cancer Cachexia.

RESULTS

In patient identification, GPT-4 had near similar or better performance compared to the corresponding disease-specific Machine Learning models (F1-score ≥ 85%) on COPD, CKD, and PBC. GPT-4 excelled in the PBC use case, achieving a 4.23% higher F1-score compared to disease-specific "Traditional Machine Learning" models. ChatGPT and LLaMA3 demonstrated lower performance than GPT-4 across all diseases and almost all metrics. Few-shot prompts also help ChatGPT, GPT-4, and LLaMA3 achieve higher precision and specificity but lower sensitivity and Negative Predictive Value.

DISCUSSION

The study highlights the potential and limitations of LLMs in healthcare. Issues with errors, explanatory limitations and ethical concerns like data privacy and model transparency suggest that these models would be supplementary tools in clinical settings. Future studies should improve training datasets and model designs for LLMs to gain better utility in healthcare.

CONCLUSION

The study shows that LLMs have the potential to assist clinicians for tasks such as patient identification but false positives and false negatives must be mitigated before LLMs are adequate for real-world clinical assistance.

摘要

目的

本研究旨在评估大型语言模型（LLM）在医疗保健中的应用，重点关注它们如何通过改进诊断、决策过程以及作为医疗保健专业人员的辅助工具来提高患者护理水平。

材料和方法

我们评估了 ChatGPT、GPT-4 和 LLaMA 在使用 MIMIC-III 数据库中的金标准电子健康记录（EHR）识别特定疾病患者方面的应用，涵盖了三种常见疾病——慢性阻塞性肺疾病（COPD）、慢性肾脏病（CKD）——以及罕见疾病原发性胆汁性肝硬化（PBC）和难以诊断的癌症恶病质。

结果

在患者识别方面，GPT-4 在 COPD、CKD 和 PBC 方面的表现与相应的疾病特异性机器学习模型（F1 分数≥85%）相似或更好。GPT-4 在 PBC 应用中表现出色，与疾病特异性“传统机器学习”模型相比，F1 分数高出 4.23%。ChatGPT 和 LLaMA3 在所有疾病和几乎所有指标上的性能都低于 GPT-4。通过Few-shot 提示，ChatGPT、GPT-4 和 LLaMA3 也可以提高精度和特异性，但降低敏感性和负预测值。

讨论

该研究强调了大型语言模型在医疗保健中的潜力和局限性。错误、解释性限制以及数据隐私和模型透明度等伦理问题表明，这些模型将成为临床环境中的补充工具。未来的研究应改进大型语言模型的训练数据集和模型设计，以提高其在医疗保健中的应用效果。

结论

该研究表明，大型语言模型有可能协助临床医生进行患者识别等任务，但在大型语言模型足以提供实际临床帮助之前，必须减轻假阳性和假阴性的问题。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用大型语言模型（如 ChatGPT、GPT-4 或 Llama）作为临床助手的潜力和陷阱。

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料和方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

使用大型语言模型（如 ChatGPT、GPT-4 或 Llama）作为临床助手的潜力和陷阱。

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料和方法

结果

讨论

结论