Wang Xinxin, Lin Shuyan, Liu Hui, Li Chuanqing, Zhou Li, Li Rongkang
Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China.
Department of Urology, South China Hospital, Medical School, Shenzhen University, Shenzhen, China.
Front Public Health. 2026 Feb 4;14:1760871. doi: 10.3389/fpubh.2026.1760871. eCollection 2026.
Gestational diabetes mellitus (GDM) is increasingly prevalent worldwide and is associated with substantial short- and long-term risks for mothers and offspring, making high-quality, accessible health information essential. At the same time, artificial intelligence (AI) chatbots based on large language models are being widely used for health queries, yet their accuracy, reliability and readability in the context of GDM remain unclear.
We first evaluated six AI chatbots (ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro and Claude Sonnet 4.5) using 200 single-best-answer multiple-choice questions (MCQs) on GDM drawn from MedQA, MedMCQA and the Chinese National Medical Examination item bank, covering four domains: epidemiology and risk factors, clinical manifestations and diagnosis, maternal and neonatal outcomes, and management and treatment. Each item was posed three times to every model under a standardized prompting protocol, and accuracy was defined as the proportion of correctly answered questions. For public-facing information, we identified 15 core GDM education questions using Google Trends and expert review, and queried four chatbots (ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5 and Gemini 2.5 Pro). Two obstetricians independently assessed reliability using DISCERN, EQIP, GQS and JAMA benchmarks, and readability was quantified using ARI, CL, FKGL, FRES, GFI and SMOG indices.
Overall MCQ accuracy differed significantly across the six chatbots ( < 0.0001), with ChatGPT-5 achieving the highest mean accuracy (92.17%) and DeepSeek-V3.2 and Gemini 2.5 Pro performing comparably well, while ChatGPT-4o, DeepSeek-R1 and Claude Sonnet 4.5 scored lower. Newer model generations (ChatGPT-5 vs. ChatGPT-4o; DeepSeek-V3.2 vs. DeepSeek-R1) consistently outperformed their predecessors across all four domains. Among the four models evaluated on public-education questions, ChatGPT-5 achieved the highest reliability scores (DISCERN 42.53 ± 7.20; EQIP 71.67 ± 6.17), whereas Claude Sonnet 4.5, DeepSeek-V3.2 and Gemini 2.5 Pro scored lower. JAMA scores were uniformly low (0-0.07/4), reflecting poor transparency. All models produced text above the recommended sixth-grade reading level; ChatGPT-5 showed the most favorable readability profile (for example, FKGL 7.43 ± 2.42, FRES 62.47 ± 13.51) but still did not meet guideline targets.
Contemporary AI chatbots can generate generally accurate and moderately reliable GDM-related information, with newer model generations showing clear gains in diagnostic validity. However, limited transparency and systematically high reading levels indicate that these tools are not yet suitable as stand-alone resources for GDM patient education and should be used as adjuncts to clinician counseling and professionally curated materials.
妊娠期糖尿病(GDM)在全球范围内日益普遍,与母亲和后代的重大短期和长期风险相关,因此高质量、易于获取的健康信息至关重要。与此同时,基于大语言模型的人工智能(AI)聊天机器人正被广泛用于健康咨询,但它们在GDM背景下的准确性、可靠性和可读性仍不明确。
我们首先使用从MedQA、MedMCQA和中国国家医学考试题库中提取的200道关于GDM的单项最佳答案多项选择题(MCQ),对六个AI聊天机器人(ChatGPT-5、ChatGPT-4o、DeepSeek-V3.2、DeepSeek-R1、Gemini 2.5 Pro和Claude Sonnet 4.5)进行评估,涵盖四个领域:流行病学和危险因素、临床表现和诊断、母婴结局以及管理和治疗。在标准化提示协议下,每个问题向每个模型提出三次,准确性定义为正确回答问题的比例。对于面向公众的信息,我们通过谷歌趋势和专家评审确定了15个核心GDM教育问题,并查询了四个聊天机器人(ChatGPT-5、DeepSeek-V3.2、Claude Sonnet 4.5和Gemini 2.5 Pro)。两名产科医生使用DISCERN、EQIP、GQS和JAMA基准独立评估可靠性,并使用ARI、CL、FKGL、FRES、GFI和SMOG指数量化可读性。
六个聊天机器人的总体MCQ准确性存在显著差异(<0.0001),ChatGPT-5的平均准确性最高(92.17%),DeepSeek-V3.2和Gemini 2.5 Pro表现相当,而ChatGPT-4o、DeepSeek-R1和Claude Sonnet 4.5得分较低。新一代模型(ChatGPT-5与ChatGPT-4o;DeepSeek-V3.2与DeepSeek-R1)在所有四个领域始终优于其前身。在针对公共教育问题评估的四个模型中,ChatGPT-5获得了最高的可靠性分数(DISCERN 42.53±7.20;EQIP 71.67±6.17),而Claude Sonnet 4.5、DeepSeek-V3.2和Gemini 2.5 Pro得分较低。JAMA分数普遍较低(0-0.07/4),反映出透明度较差。所有模型生成的文本都高于推荐的六年级阅读水平;ChatGPT-5显示出最有利的可读性特征(例如,FKGL 7.43±2.42,FRES 62.47±13.51),但仍未达到指南目标。
当代AI聊天机器人可以生成总体准确且适度可靠的GDM相关信息,新一代模型在诊断有效性方面有明显提升。然而,透明度有限和系统的高阅读水平表明,这些工具尚不适合作为GDM患者教育的独立资源,应作为临床医生咨询和专业策划材料的辅助工具使用。