文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

系统基准测试表明,大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

作者信息

Reese Justin T, Chimirri Leonardo, Bridges Yasemin, Danis Daniel, Caufield J Harry, Wissink Kyran, McMurry Julie A, Graefe Adam Sl, Casiraghi Elena, Valentini Giorgio, Jacobsen Julius Ob, Haendel Melissa, Smedley Damian, Mungall Christopher J, Robinson Peter N

机构信息

Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Monarch Initiative.

出版信息

medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.


DOI:10.1101/2024.07.22.24310816
PMID:39108510
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11302616/
Abstract

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5,213 case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to three generative pretrained transformer (GPT) models. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

摘要

大语言模型(LLMs)在支持鉴别诊断方面显示出前景,但由于其回复的非结构化性质,对其性能进行评估具有挑战性。为了评估大语言模型诊断遗传疾病的当前能力,我们使用表型数据包模式、人类表型本体和蒙多疾病本体,在5213份病例报告上对这些模型进行了基准测试。从每个表型数据包生成的提示被发送到三个生成式预训练变压器(GPT)模型。相同的表型数据包被用作仅表型模式下广泛使用的诊断工具Exomiser的输入。最佳的大语言模型在23.6%的病例中首先给出了正确诊断,而Exomiser在35.5%的病例中做到了这一点。虽然大语言模型在支持鉴别诊断方面的性能一直在提高,但尚未达到常用传统生物信息学工具的水平。需要未来的研究来确定将大语言模型纳入诊断流程的最佳方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c28d/11563241/97d27b104952/nihpp-2024.07.22.24310816v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c28d/11563241/53d6a99cf512/nihpp-2024.07.22.24310816v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c28d/11563241/97d27b104952/nihpp-2024.07.22.24310816v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c28d/11563241/53d6a99cf512/nihpp-2024.07.22.24310816v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c28d/11563241/97d27b104952/nihpp-2024.07.22.24310816v2-f0002.jpg

相似文献

[1]
Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

medRxiv. 2024-11-7

[2]
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022-5-20

[3]
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.

Int J Lang Commun Disord. 2025

[4]
Interventions for central serous chorioretinopathy: a network meta-analysis.

Cochrane Database Syst Rev. 2025-6-16

[5]
A dataset and benchmark for hospital course summarization with adapted large language models.

J Am Med Inform Assoc. 2025-3-1

[6]
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.

J Med Internet Res. 2025-7-2

[7]
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.

JMIR Med Inform. 2025-6-20

[8]
Non-invasive diagnostic tests for Helicobacter pylori infection.

Cochrane Database Syst Rev. 2018-3-15

[9]
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.

JMIR Med Inform. 2025-7-8

[10]
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.

Health Technol Assess. 2001

本文引用的文献

[1]
Towards a standard benchmark for phenotype-driven variant and gene prioritisation algorithms: PhEval - Phenotypic inference Evaluation framework.

BMC Bioinformatics. 2025-3-22

[2]
A generalist medical language model for disease diagnosis assistance.

Nat Med. 2025-3

[3]
Leveraging clinical intuition to improve accuracy of phenotype-driven prioritization.

Genet Med. 2025-1

[4]
A corpus of GA4GH phenopackets: Case-level phenotyping for genomic diagnostics and discovery.

HGG Adv. 2025-1-9

[5]
OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning.

Cureus. 2024-10-1

[6]
RDguru: A Conversational Intelligent Agent for Rare Diseases.

IEEE J Biomed Health Inform. 2024-9-19

[7]
Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.

Am J Hum Genet. 2024-10-3

[8]
Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.

Am J Hum Genet. 2024-9-5

[9]
Evaluation of large language models as a diagnostic aid for complex medical cases.

Front Med (Lausanne). 2024-6-20

[10]
Evaluation and mitigation of the limitations of large language models in clinical decision-making.

Nat Med. 2024-9

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索