系统基准测试表明，大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

作者信息

Reese Justin T, Chimirri Leonardo, Bridges Yasemin, Danis Daniel, Caufield J Harry, Wissink Kyran, McMurry Julie A, Graefe Adam Sl, Casiraghi Elena, Valentini Giorgio, Jacobsen Julius Ob, Haendel Melissa, Smedley Damian, Mungall Christopher J, Robinson Peter N

机构信息

Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Monarch Initiative.

出版信息

medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.

DOI:10.1101/2024.07.22.24310816

PMID:39108510

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11302616/

Abstract

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5,213 case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to three generative pretrained transformer (GPT) models. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

摘要

大语言模型（LLMs）在支持鉴别诊断方面显示出前景，但由于其回复的非结构化性质，对其性能进行评估具有挑战性。为了评估大语言模型诊断遗传疾病的当前能力，我们使用表型数据包模式、人类表型本体和蒙多疾病本体，在5213份病例报告上对这些模型进行了基准测试。从每个表型数据包生成的提示被发送到三个生成式预训练变压器（GPT）模型。相同的表型数据包被用作仅表型模式下广泛使用的诊断工具Exomiser的输入。最佳的大语言模型在23.6%的病例中首先给出了正确诊断，而Exomiser在35.5%的病例中做到了这一点。虽然大语言模型在支持鉴别诊断方面的性能一直在提高，但尚未达到常用传统生物信息学工具的水平。需要未来的研究来确定将大语言模型纳入诊断流程的最佳方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c28d/11563241/53d6a99cf512/nihpp-2024.07.22.24310816v2-f0001.jpg

相似文献

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.系统基准测试表明，大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。

medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.错误的恶臭还是潜力的光辉：言语病理学中（不）负责任地使用ChatGPT的挑战。

Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.

Interventions for central serous chorioretinopathy: a network meta-analysis.中心性浆液性脉络膜视网膜病变的干预措施：一项网状Meta分析

Cochrane Database Syst Rev. 2025 Jun 16;6(6):CD011841. doi: 10.1002/14651858.CD011841.pub3.

A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类：信息流行病学研究

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力：方法开发研究

JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.

Non-invasive diagnostic tests for Helicobacter pylori infection.幽门螺杆菌感染的非侵入性诊断测试。

Cochrane Database Syst Rev. 2018 Mar 15;3(3):CD012080. doi: 10.1002/14651858.CD012080.pub2.

Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署：用于日语疾病名称识别的微调大语言模型的模型开发与验证

JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。

Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.

本文引用的文献

Towards a standard benchmark for phenotype-driven variant and gene prioritisation algorithms: PhEval - Phenotypic inference Evaluation framework.迈向用于表型驱动的变异体和基因优先级排序算法的标准基准：PhEval - 表型推断评估框架。

BMC Bioinformatics. 2025 Mar 22;26(1):87. doi: 10.1186/s12859-025-06105-4.

A generalist medical language model for disease diagnosis assistance.用于疾病诊断辅助的通用医学语言模型。

Nat Med. 2025 Mar;31(3):932-942. doi: 10.1038/s41591-024-03416-6. Epub 2025 Jan 8.

Leveraging clinical intuition to improve accuracy of phenotype-driven prioritization.利用临床直觉提高表型驱动优先级排序的准确性。

Genet Med. 2025 Jan;27(1):101292. doi: 10.1016/j.gim.2024.101292. Epub 2024 Oct 10.

A corpus of GA4GH phenopackets: Case-level phenotyping for genomic diagnostics and discovery.GA4GH 表型数据包语料库：用于基因组诊断和发现的病例级表型分析。

HGG Adv. 2025 Jan 9;6(1):100371. doi: 10.1016/j.xhgg.2024.100371. Epub 2024 Oct 10.

OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning.医疗领域中OpenAI的o1-预览版与ChatGPT对比：医学人工智能推理的新前沿

Cureus. 2024 Oct 1;16(10):e70640. doi: 10.7759/cureus.70640. eCollection 2024 Oct.

RDguru: A Conversational Intelligent Agent for Rare Diseases.RDguru：一种用于罕见病的对话式智能代理。

IEEE J Biomed Health Inform. 2024 Sep 19;PP. doi: 10.1109/JBHI.2024.3464555.

Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.评估大型语言模型在罕见遗传疾病诊断中基于表型的基因优先级排序中的效用。

Am J Hum Genet. 2024 Oct 3;111(10):2190-2202. doi: 10.1016/j.ajhg.2024.08.010. Epub 2024 Sep 9.

Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.评估大型语言模型在医学、非专业语言和遗传状况的自我报告描述方面的表现。

Am J Hum Genet. 2024 Sep 5;111(9):1819-1833. doi: 10.1016/j.ajhg.2024.07.011. Epub 2024 Aug 14.

Evaluation of large language models as a diagnostic aid for complex medical cases.评估大型语言模型作为复杂医疗病例诊断辅助工具的作用。

Front Med (Lausanne). 2024 Jun 20;11:1380148. doi: 10.3389/fmed.2024.1380148. eCollection 2024.

Evaluation and mitigation of the limitations of large language models in clinical decision-making.评估和缓解大型语言模型在临床决策中的局限性。

Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

系统基准测试表明，大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献