• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估:一项多模型比较研究。

Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.

作者信息

Wang Xinxin, Lin Shuyan, Liu Hui, Li Chuanqing, Zhou Li, Li Rongkang

机构信息

Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China.

Department of Urology, South China Hospital, Medical School, Shenzhen University, Shenzhen, China.

出版信息

Front Public Health. 2026 Feb 4;14:1760871. doi: 10.3389/fpubh.2026.1760871. eCollection 2026.

DOI:10.3389/fpubh.2026.1760871
PMID:41717624
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12913397/
Abstract

BACKGROUND

Gestational diabetes mellitus (GDM) is increasingly prevalent worldwide and is associated with substantial short- and long-term risks for mothers and offspring, making high-quality, accessible health information essential. At the same time, artificial intelligence (AI) chatbots based on large language models are being widely used for health queries, yet their accuracy, reliability and readability in the context of GDM remain unclear.

METHODS

We first evaluated six AI chatbots (ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro and Claude Sonnet 4.5) using 200 single-best-answer multiple-choice questions (MCQs) on GDM drawn from MedQA, MedMCQA and the Chinese National Medical Examination item bank, covering four domains: epidemiology and risk factors, clinical manifestations and diagnosis, maternal and neonatal outcomes, and management and treatment. Each item was posed three times to every model under a standardized prompting protocol, and accuracy was defined as the proportion of correctly answered questions. For public-facing information, we identified 15 core GDM education questions using Google Trends and expert review, and queried four chatbots (ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5 and Gemini 2.5 Pro). Two obstetricians independently assessed reliability using DISCERN, EQIP, GQS and JAMA benchmarks, and readability was quantified using ARI, CL, FKGL, FRES, GFI and SMOG indices.

RESULTS

Overall MCQ accuracy differed significantly across the six chatbots ( < 0.0001), with ChatGPT-5 achieving the highest mean accuracy (92.17%) and DeepSeek-V3.2 and Gemini 2.5 Pro performing comparably well, while ChatGPT-4o, DeepSeek-R1 and Claude Sonnet 4.5 scored lower. Newer model generations (ChatGPT-5 vs. ChatGPT-4o; DeepSeek-V3.2 vs. DeepSeek-R1) consistently outperformed their predecessors across all four domains. Among the four models evaluated on public-education questions, ChatGPT-5 achieved the highest reliability scores (DISCERN 42.53 ± 7.20; EQIP 71.67 ± 6.17), whereas Claude Sonnet 4.5, DeepSeek-V3.2 and Gemini 2.5 Pro scored lower. JAMA scores were uniformly low (0-0.07/4), reflecting poor transparency. All models produced text above the recommended sixth-grade reading level; ChatGPT-5 showed the most favorable readability profile (for example, FKGL 7.43 ± 2.42, FRES 62.47 ± 13.51) but still did not meet guideline targets.

CONCLUSION

Contemporary AI chatbots can generate generally accurate and moderately reliable GDM-related information, with newer model generations showing clear gains in diagnostic validity. However, limited transparency and systematically high reading levels indicate that these tools are not yet suitable as stand-alone resources for GDM patient education and should be used as adjuncts to clinician counseling and professionally curated materials.

摘要

背景

妊娠期糖尿病(GDM)在全球范围内日益普遍,与母亲和后代的重大短期和长期风险相关,因此高质量、易于获取的健康信息至关重要。与此同时,基于大语言模型的人工智能(AI)聊天机器人正被广泛用于健康咨询,但它们在GDM背景下的准确性、可靠性和可读性仍不明确。

方法

我们首先使用从MedQA、MedMCQA和中国国家医学考试题库中提取的200道关于GDM的单项最佳答案多项选择题(MCQ),对六个AI聊天机器人(ChatGPT-5、ChatGPT-4o、DeepSeek-V3.2、DeepSeek-R1、Gemini 2.5 Pro和Claude Sonnet 4.5)进行评估,涵盖四个领域:流行病学和危险因素、临床表现和诊断、母婴结局以及管理和治疗。在标准化提示协议下,每个问题向每个模型提出三次,准确性定义为正确回答问题的比例。对于面向公众的信息,我们通过谷歌趋势和专家评审确定了15个核心GDM教育问题,并查询了四个聊天机器人(ChatGPT-5、DeepSeek-V3.2、Claude Sonnet 4.5和Gemini 2.5 Pro)。两名产科医生使用DISCERN、EQIP、GQS和JAMA基准独立评估可靠性,并使用ARI、CL、FKGL、FRES、GFI和SMOG指数量化可读性。

结果

六个聊天机器人的总体MCQ准确性存在显著差异(<0.0001),ChatGPT-5的平均准确性最高(92.17%),DeepSeek-V3.2和Gemini 2.5 Pro表现相当,而ChatGPT-4o、DeepSeek-R1和Claude Sonnet 4.5得分较低。新一代模型(ChatGPT-5与ChatGPT-4o;DeepSeek-V3.2与DeepSeek-R1)在所有四个领域始终优于其前身。在针对公共教育问题评估的四个模型中,ChatGPT-5获得了最高的可靠性分数(DISCERN 42.53±7.20;EQIP 71.67±6.17),而Claude Sonnet 4.5、DeepSeek-V3.2和Gemini 2.5 Pro得分较低。JAMA分数普遍较低(0-0.07/4),反映出透明度较差。所有模型生成的文本都高于推荐的六年级阅读水平;ChatGPT-5显示出最有利的可读性特征(例如,FKGL 7.43±2.42,FRES 62.47±13.51),但仍未达到指南目标。

结论

当代AI聊天机器人可以生成总体准确且适度可靠的GDM相关信息,新一代模型在诊断有效性方面有明显提升。然而,透明度有限和系统的高阅读水平表明,这些工具尚不适合作为GDM患者教育的独立资源,应作为临床医生咨询和专业策划材料的辅助工具使用。

相似文献

1
Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估:一项多模型比较研究。
Front Public Health. 2026 Feb 4;14:1760871. doi: 10.3389/fpubh.2026.1760871. eCollection 2026.
2
Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek.
J Clin Med. 2025 Nov 3;14(21):7804. doi: 10.3390/jcm14217804.
3
Evaluating the use of advanced large language models to improve readability of head and neck cancer patient education materials.评估先进的大语言模型在提高头颈癌患者教育材料可读性方面的应用。
Am J Otolaryngol. 2025 Nov-Dec;46(6):104744. doi: 10.1016/j.amjoto.2025.104744. Epub 2025 Oct 20.
4
AI-generated patient education for ankylosing spondylitis: a comparative study of readability and quality.
Clin Rheumatol. 2026 Mar;45(3):2003-2008. doi: 10.1007/s10067-025-07771-8. Epub 2025 Dec 13.
5
Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears.
Healthcare (Basel). 2025 Oct 23;13(21):2670. doi: 10.3390/healthcare13212670.
6
AI Chatbots in Answering Questions Related to Ocular Oncology: A Comparative Study Between DeepSeek v3, ChatGPT-4o, and Gemini 2.0.人工智能聊天机器人在回答与眼部肿瘤学相关问题中的应用:DeepSeek v3、ChatGPT-4o和Gemini 2.0的比较研究
Cureus. 2025 Aug 22;17(8):e90773. doi: 10.7759/cureus.90773. eCollection 2025 Aug.
7
Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models.评估用于脊柱手术的人工智能生成的患者教育材料:ChatGPT和DeepSeek模型之间可读性和DISCERN质量的比较分析。
Int J Med Inform. 2025 Jun;198:105871. doi: 10.1016/j.ijmedinf.2025.105871. Epub 2025 Mar 13.
8
Multicriteria Assessment of Text Quality in Large Language Model-Generated Gynecomastia Materials: DeepSeek Versus OpenAI Versus Claude.大语言模型生成的男性乳腺增生症材料中文本质量的多标准评估:DeepSeek 与 OpenAI 与 Claude 的比较
J Craniofac Surg. 2025 Sep 10. doi: 10.1097/SCS.0000000000011930.
9
Examining Artificial Intelligence Chatbots' Responses in Providing Human Papillomavirus Vaccine Information for Young Adults: Qualitative Content Analysis.
JMIR Public Health Surveill. 2026 Feb 18;12:e79720. doi: 10.2196/79720.
10
Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性:一项观察性横断面研究。
Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.