Reese Justin T, Chimirri Leonardo, Bridges Yasemin, Danis Daniel, Caufield J Harry, Wissink Kyran, McMurry Julie A, Graefe Adam Sl, Casiraghi Elena, Valentini Giorgio, Jacobsen Julius Ob, Haendel Melissa, Smedley Damian, Mungall Christopher J, Robinson Peter N
Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
Monarch Initiative.
medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.
Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5,213 case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to three generative pretrained transformer (GPT) models. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.
大语言模型(LLMs)在支持鉴别诊断方面显示出前景,但由于其回复的非结构化性质,对其性能进行评估具有挑战性。为了评估大语言模型诊断遗传疾病的当前能力,我们使用表型数据包模式、人类表型本体和蒙多疾病本体,在5213份病例报告上对这些模型进行了基准测试。从每个表型数据包生成的提示被发送到三个生成式预训练变压器(GPT)模型。相同的表型数据包被用作仅表型模式下广泛使用的诊断工具Exomiser的输入。最佳的大语言模型在23.6%的病例中首先给出了正确诊断,而Exomiser在35.5%的病例中做到了这一点。虽然大语言模型在支持鉴别诊断方面的性能一直在提高,但尚未达到常用传统生物信息学工具的水平。需要未来的研究来确定将大语言模型纳入诊断流程的最佳方法。
Cochrane Database Syst Rev. 2022-5-20
Cochrane Database Syst Rev. 2025-6-16
J Am Med Inform Assoc. 2025-3-1
Cochrane Database Syst Rev. 2018-3-15
IEEE J Biomed Health Inform. 2024-9-19
Front Med (Lausanne). 2024-6-20