ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能：比较研究

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

作者信息

Zhong Wei, Liu YiFan, Liu Yan, Yang Kai, Gao HuiMin, Yan HuiHui, Hao WenJing, Yan YouSheng, Yin ChengHong

机构信息

Department of Prenatal Diagnosis, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing Maternal and Child Health Care Hospital, No. 251 Yaojiayuan Road, Chaoyang District, Beijing, China, 8618810963279.

出版信息

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

DOI:10.2196/69929

PMID:40532199

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12192912/

Abstract

BACKGROUND

Diagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows.

OBJECTIVE

This study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning.

METHODS

We extracted clinical manifestations of 121 rare diseases from China's inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians' familiarity with rare diseases.

RESULTS

ChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ²1=0.32, P=.57), whereas Llama3.1:8b exhibited significantly higher English accuracy (67.8% vs 31.4%; χ²1=40.20, P<.001). Among larger models, qwen2.5:72b maintained cross-lingual consistency considering the odds ratio (OR; Chinese: 82.6% vs English: 83.5%; OR 0.88, 95% CI 0.27-2.76,P=1.000), contrasting with Llama3.1:70b's language-dependent variation (Chinese: 80.2% vs English: 90.1%; OR 0.29,95% CI 0.08-0.83, P=.02). Cross-model comparisons revealed Llama3.1:8b underperformed qwen2.5:7b in Chinese (χ²1=13.22,P<.001) but surpassed it in English (χ²1=13.92,P<.001). No significant differences were observed between qwen2.5:72b and Llama3.1:70b (English: OR 0.33, P=.08; Chinese: OR 1.5, 95% CI 0.48-5.12,P=.07); qwen2.5:72b matched ChatGPT-4o's performance in both languages (English: OR 0.33, P=.08; Chinese: OR 0.44, P=.09); Llama3.1:70b mirrored ChatGPT-4o's English accuracy (OR 1, P=1.000) but lagged in Chinese (OR 0.33; P=.02). RAG implementation enhanced qwen2.5:7b's accuracy to 79.3% (χ²1=31.11, P<.001) with 85.9% retrieval precision. The distilled model Deepseek-R1:7b markedly underperformed (9.9% vs qwen2.5:7b; χ²1=42.19, P<.001). Clinician surveys revealed significant knowledge gaps in rare disease management.

CONCLUSIONS

ChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models.

摘要

背景

由于罕见病本身的复杂性以及医生相关知识有限，对其进行诊断仍然具有挑战性。大语言模型（LLMs）为改进诊断流程带来了新的潜力。

目的

本研究旨在评估ChatGPT-4o和4个开源大语言模型（清问2.5:7B、Llama3.1:8B、清问2.5:72B和Llama3.1:70B）对罕见病的诊断准确性，评估语言对诊断性能的影响，并探索检索增强生成（RAG）和思维链（CoT）推理。

方法

我们从中国首个罕见病目录中提取了121种罕见病的临床表现。ChatGPT-4o生成了一个初步诊断和5个鉴别诊断，而4个大语言模型在英文和中文环境下均进行了评估。对表现最差的模型进行了RAG和CoT重新评估。通过McNemar检验比较诊断准确性。一项调查评估了11名临床医生对罕见病的熟悉程度。

结果

ChatGPT-4o的诊断准确性最高，为90.1%。不同模型的语言效应有所不同：清问2.5:7B在中文（51.2%）和英文（47.9%；χ²1=0.32，P=0.57）环境下表现相当，而Llama3.1:8B在英文环境下的准确性显著更高（67.8%对31.4%；χ²1=40.20，P<0.001）。在更大的模型中，考虑优势比（OR），清问2.5:72B保持了跨语言一致性（中文：82.6%对英文：83.5%；OR 0.88，95%CI 0.27 - 2.76，P=1.000），这与Llama3.1:70B的语言依赖性变化形成对比（中文：80.2%对英文：90.1%；OR 0.29，95%CI 0.08 - 0.83，P=0.02）。跨模型比较显示，Llama3.1:8B在中文环境下的表现不如清问2.5:7B（χ²1=13.22，P<0.001），但在英文环境下超过了清问2.5:7B（χ²1=13.92，P<0.001）。清问2.5:72B和Llama3.1:70B之间未观察到显著差异（英文：OR 0.33，P=0.08；中文：OR 1.5，95%CI 0.48 - 5.12，P=0.07）；清问2.5:72B在两种语言中的表现均与ChatGPT-4o相当（英文：OR 0.33，P=0.08；中文：OR 0.44，P=0.09）；Llama3.1:70B在英文环境下的准确性与ChatGPT-4o相当（OR 1，P=1.000），但在中文环境下落后（OR 0.33；P=0.02）。实施RAG将清问2.5:7B的准确性提高到了79.3%（χ²1=31.11，P<0.001），检索精度为85.9%。精简模型Deepseek-R1:7B的表现明显较差（9.9%对清问2.5:7B；χ²1=42.19，P<0.001）。临床医生调查显示，在罕见病管理方面存在显著的知识差距。

结论

ChatGPT-4o在罕见病诊断方面表现出卓越的性能。虽然Llama3.1:8B在资源受限的英文诊断流程中进行本地化部署具有可行性，但中文应用需要更大的模型才能达到 comparable 的诊断准确性。像DeepSeek-R1这样的开源模型的发布加剧了这种紧迫性，这些模型可能在未经充分验证的情况下迅速被采用。大语言模型在临床的成功应用需要三个核心要素：模型参数化、用户语言和预训练数据。RAG的整合显著提高了开源大语言模型对罕见病诊断的准确性，尽管对于表现出显著性能限制的低参数推理模型仍需谨慎。我们建议医院信息技术部门和政策制定者在模型选择中优先考虑语言相关性，并考虑将RAG与精心策划的知识库集成，以在受限环境中提高诊断效用，同时对低参数模型保持谨慎。