论大语言模型在临床诊断中的局限性

On the limitations of large language models in clinical diagnosis.

作者信息

Reese Justin T, Danis Daniel, Caufield J Harry, Groza Tudor, Casiraghi Elena, Valentini Giorgio, Mungall Christopher J, Robinson Peter N

机构信息

Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.

The Jackson Laboratory for Genomic Medicine, Farmington CT, 06032, USA.

出版信息

medRxiv. 2024 Feb 26:2023.07.13.23292613. doi: 10.1101/2023.07.13.23292613.

DOI:10.1101/2023.07.13.23292613

PMID:37503093

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10370243/

Abstract

OBJECTIVE

Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.

MATERIALS AND METHODS

We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.

RESULTS

Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.

DISCUSSION

The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.

CONCLUSION

Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

摘要

目的

诸如GPT-4之类的大语言模型此前已被应用于基于已发表病例报告的鉴别诊断挑战。已发表的病例报告具有复杂的叙事风格，这在典型的电子健康记录（EHR）中并不常见。此外，即使EHR中有这样的叙事内容，隐私要求也会阻止将其发送到医院防火墙之外。因此，我们测试了一种解析临床文本以提取本体术语并以编程方式生成提示的方法，这些提示在设计上不包含受保护的健康信息。

材料与方法

我们研究了从75份最近发表的病例报告中准备提示的不同方法。我们通过提取代表表型异常、合并症、治疗方法和实验室检查的结构化术语并以编程方式创建提示来转换原始叙事内容。

结果

所有这些方法的表现都一般，正确诊断在仅5.3%-17.6%的病例中排名第一。从结构化数据创建的提示的表现明显比原始叙事文本差，即使在人工审核术语提取后添加了额外信息也是如此。此外，不同版本的GPT-4在这项任务上表现出显著不同的性能。

讨论

性能对提示形式的敏感性以及两个GPT-4版本结果的不稳定性是目前使用GPT-4在现实临床环境中支持诊断的重要限制。

结论

需要开展研究以确定从典型可用临床数据创建提示以支持鉴别诊断的最佳方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

论大语言模型在临床诊断中的局限性

On the limitations of large language models in clinical diagnosis.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

本文引用的文献

论大语言模型在临床诊断中的局限性

On the limitations of large language models in clinical diagnosis.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

本文引用的文献