用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估：一项多模型比较研究。

Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.

作者信息

Wang Xinxin, Lin Shuyan, Liu Hui, Li Chuanqing, Zhou Li, Li Rongkang

机构信息

Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China.

Department of Urology, South China Hospital, Medical School, Shenzhen University, Shenzhen, China.

出版信息

Front Public Health. 2026 Feb 4;14:1760871. doi: 10.3389/fpubh.2026.1760871. eCollection 2026.

DOI:10.3389/fpubh.2026.1760871

PMID:41717624

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12913397/

Abstract

BACKGROUND

Gestational diabetes mellitus (GDM) is increasingly prevalent worldwide and is associated with substantial short- and long-term risks for mothers and offspring, making high-quality, accessible health information essential. At the same time, artificial intelligence (AI) chatbots based on large language models are being widely used for health queries, yet their accuracy, reliability and readability in the context of GDM remain unclear.

METHODS

We first evaluated six AI chatbots (ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro and Claude Sonnet 4.5) using 200 single-best-answer multiple-choice questions (MCQs) on GDM drawn from MedQA, MedMCQA and the Chinese National Medical Examination item bank, covering four domains: epidemiology and risk factors, clinical manifestations and diagnosis, maternal and neonatal outcomes, and management and treatment. Each item was posed three times to every model under a standardized prompting protocol, and accuracy was defined as the proportion of correctly answered questions. For public-facing information, we identified 15 core GDM education questions using Google Trends and expert review, and queried four chatbots (ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5 and Gemini 2.5 Pro). Two obstetricians independently assessed reliability using DISCERN, EQIP, GQS and JAMA benchmarks, and readability was quantified using ARI, CL, FKGL, FRES, GFI and SMOG indices.

RESULTS

Overall MCQ accuracy differed significantly across the six chatbots ( < 0.0001), with ChatGPT-5 achieving the highest mean accuracy (92.17%) and DeepSeek-V3.2 and Gemini 2.5 Pro performing comparably well, while ChatGPT-4o, DeepSeek-R1 and Claude Sonnet 4.5 scored lower. Newer model generations (ChatGPT-5 vs. ChatGPT-4o; DeepSeek-V3.2 vs. DeepSeek-R1) consistently outperformed their predecessors across all four domains. Among the four models evaluated on public-education questions, ChatGPT-5 achieved the highest reliability scores (DISCERN 42.53 ± 7.20; EQIP 71.67 ± 6.17), whereas Claude Sonnet 4.5, DeepSeek-V3.2 and Gemini 2.5 Pro scored lower. JAMA scores were uniformly low (0-0.07/4), reflecting poor transparency. All models produced text above the recommended sixth-grade reading level; ChatGPT-5 showed the most favorable readability profile (for example, FKGL 7.43 ± 2.42, FRES 62.47 ± 13.51) but still did not meet guideline targets.

CONCLUSION

Contemporary AI chatbots can generate generally accurate and moderately reliable GDM-related information, with newer model generations showing clear gains in diagnostic validity. However, limited transparency and systematically high reading levels indicate that these tools are not yet suitable as stand-alone resources for GDM patient education and should be used as adjuncts to clinician counseling and professionally curated materials.

摘要

背景

妊娠期糖尿病（GDM）在全球范围内日益普遍，与母亲和后代的重大短期和长期风险相关，因此高质量、易于获取的健康信息至关重要。与此同时，基于大语言模型的人工智能（AI）聊天机器人正被广泛用于健康咨询，但它们在GDM背景下的准确性、可靠性和可读性仍不明确。

方法

我们首先使用从MedQA、MedMCQA和中国国家医学考试题库中提取的200道关于GDM的单项最佳答案多项选择题（MCQ），对六个AI聊天机器人（ChatGPT-5、ChatGPT-4o、DeepSeek-V3.2、DeepSeek-R1、Gemini 2.5 Pro和Claude Sonnet 4.5）进行评估，涵盖四个领域：流行病学和危险因素、临床表现和诊断、母婴结局以及管理和治疗。在标准化提示协议下，每个问题向每个模型提出三次，准确性定义为正确回答问题的比例。对于面向公众的信息，我们通过谷歌趋势和专家评审确定了15个核心GDM教育问题，并查询了四个聊天机器人（ChatGPT-5、DeepSeek-V3.2、Claude Sonnet 4.5和Gemini 2.5 Pro）。两名产科医生使用DISCERN、EQIP、GQS和JAMA基准独立评估可靠性，并使用ARI、CL、FKGL、FRES、GFI和SMOG指数量化可读性。

结果

六个聊天机器人的总体MCQ准确性存在显著差异（<0.0001），ChatGPT-5的平均准确性最高（92.17%），DeepSeek-V3.2和Gemini 2.5 Pro表现相当，而ChatGPT-4o、DeepSeek-R1和Claude Sonnet 4.5得分较低。新一代模型（ChatGPT-5与ChatGPT-4o；DeepSeek-V3.2与DeepSeek-R1）在所有四个领域始终优于其前身。在针对公共教育问题评估的四个模型中，ChatGPT-5获得了最高的可靠性分数（DISCERN 42.53±7.20；EQIP 71.67±6.17），而Claude Sonnet 4.5、DeepSeek-V3.2和Gemini 2.5 Pro得分较低。JAMA分数普遍较低（0-0.07/4），反映出透明度较差。所有模型生成的文本都高于推荐的六年级阅读水平；ChatGPT-5显示出最有利的可读性特征（例如，FKGL 7.43±2.42，FRES 62.47±13.51），但仍未达到指南目标。

结论

当代AI聊天机器人可以生成总体准确且适度可靠的GDM相关信息，新一代模型在诊断有效性方面有明显提升。然而，透明度有限和系统的高阅读水平表明，这些工具尚不适合作为GDM患者教育的独立资源，应作为临床医生咨询和专业策划材料的辅助工具使用。

相似文献

Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估：一项多模型比较研究。

Front Public Health. 2026 Feb 4;14:1760871. doi: 10.3389/fpubh.2026.1760871. eCollection 2026.

Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek.

J Clin Med. 2025 Nov 3;14(21):7804. doi: 10.3390/jcm14217804.

Evaluating the use of advanced large language models to improve readability of head and neck cancer patient education materials.评估先进的大语言模型在提高头颈癌患者教育材料可读性方面的应用。

Am J Otolaryngol. 2025 Nov-Dec;46(6):104744. doi: 10.1016/j.amjoto.2025.104744. Epub 2025 Oct 20.

AI-generated patient education for ankylosing spondylitis: a comparative study of readability and quality.

Clin Rheumatol. 2026 Mar;45(3):2003-2008. doi: 10.1007/s10067-025-07771-8. Epub 2025 Dec 13.

Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears.

Healthcare (Basel). 2025 Oct 23;13(21):2670. doi: 10.3390/healthcare13212670.

AI Chatbots in Answering Questions Related to Ocular Oncology: A Comparative Study Between DeepSeek v3, ChatGPT-4o, and Gemini 2.0.人工智能聊天机器人在回答与眼部肿瘤学相关问题中的应用：DeepSeek v3、ChatGPT-4o和Gemini 2.0的比较研究

Cureus. 2025 Aug 22;17(8):e90773. doi: 10.7759/cureus.90773. eCollection 2025 Aug.

Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models.评估用于脊柱手术的人工智能生成的患者教育材料：ChatGPT和DeepSeek模型之间可读性和DISCERN质量的比较分析。

Int J Med Inform. 2025 Jun;198:105871. doi: 10.1016/j.ijmedinf.2025.105871. Epub 2025 Mar 13.

Multicriteria Assessment of Text Quality in Large Language Model-Generated Gynecomastia Materials: DeepSeek Versus OpenAI Versus Claude.大语言模型生成的男性乳腺增生症材料中文本质量的多标准评估：DeepSeek 与 OpenAI 与 Claude 的比较

J Craniofac Surg. 2025 Sep 10. doi: 10.1097/SCS.0000000000011930.

Examining Artificial Intelligence Chatbots' Responses in Providing Human Papillomavirus Vaccine Information for Young Adults: Qualitative Content Analysis.

JMIR Public Health Surveill. 2026 Feb 18;12:e79720. doi: 10.2196/79720.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性：一项观察性横断面研究。

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估：一项多模型比较研究。

Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.

作者信息

Wang Xinxin, Lin Shuyan, Liu Hui, Li Chuanqing, Zhou Li, Li Rongkang

机构信息

Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China.

Department of Urology, South China Hospital, Medical School, Shenzhen University, Shenzhen, China.

出版信息

Front Public Health. 2026 Feb 4;14:1760871. doi: 10.3389/fpubh.2026.1760871. eCollection 2026.

DOI:10.3389/fpubh.2026.1760871

PMID:41717624

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12913397/

Abstract

BACKGROUND

METHODS

RESULTS

CONCLUSION

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估：一项多模型比较研究。

Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

用于妊娠期糖尿病的人工智能聊天机器人的效度、信度和可读性评估：一项多模型比较研究。

Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献