Reese Justin T, Danis Daniel, Caufield J Harry, Groza Tudor, Casiraghi Elena, Valentini Giorgio, Mungall Christopher J, Robinson Peter N
Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
The Jackson Laboratory for Genomic Medicine, Farmington CT, 06032, USA.
medRxiv. 2024 Feb 26:2023.07.13.23292613. doi: 10.1101/2023.07.13.23292613.
Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.
We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.
Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.
The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.
Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.
诸如GPT-4之类的大语言模型此前已被应用于基于已发表病例报告的鉴别诊断挑战。已发表的病例报告具有复杂的叙事风格,这在典型的电子健康记录(EHR)中并不常见。此外,即使EHR中有这样的叙事内容,隐私要求也会阻止将其发送到医院防火墙之外。因此,我们测试了一种解析临床文本以提取本体术语并以编程方式生成提示的方法,这些提示在设计上不包含受保护的健康信息。
我们研究了从75份最近发表的病例报告中准备提示的不同方法。我们通过提取代表表型异常、合并症、治疗方法和实验室检查的结构化术语并以编程方式创建提示来转换原始叙事内容。
所有这些方法的表现都一般,正确诊断在仅5.3%-17.6%的病例中排名第一。从结构化数据创建的提示的表现明显比原始叙事文本差,即使在人工审核术语提取后添加了额外信息也是如此。此外,不同版本的GPT-4在这项任务上表现出显著不同的性能。
性能对提示形式的敏感性以及两个GPT-4版本结果的不稳定性是目前使用GPT-4在现实临床环境中支持诊断的重要限制。
需要开展研究以确定从典型可用临床数据创建提示以支持鉴别诊断的最佳方法。