基于2023年指南建议对ChatGPT-4、ChatGPT-3.5和谷歌Gemini在多囊卵巢综合征评估与管理方面的比较评估

Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline.

作者信息

Gunesli Irmak, Aksun Seren, Fathelbab Jana, Yildiz Bulent Okan

机构信息

Hacettepe University School of Medicine, Department of Internal Medicine, Ankara, Turkey.

Hacettepe University School of Medicine, Division of Endocrinology and Metabolism, Ankara, Turkey.

出版信息

Endocrine. 2025 Apr;88(1):315-322. doi: 10.1007/s12020-024-04121-7. Epub 2024 Dec 2.

DOI:10.1007/s12020-024-04121-7

PMID:39623241

Abstract

CONTEXT

Artificial intelligence (AI) is increasingly utilized in healthcare, with models like ChatGPT and Google Gemini gaining global popularity. Polycystic ovary syndrome (PCOS) is a prevalent condition that requires both lifestyle modifications and medical treatment, highlighting the critical need for effective patient education. This study compares the responses of ChatGPT-4, ChatGPT-3.5 and Gemini to PCOS-related questions using the latest guideline. Evaluating AI's integration into patient education necessitates assessing response quality, reliability, readability and effectiveness in managing PCOS.

PURPOSE

To evaluate the accuracy, quality, readability and tendency to hallucinate of ChatGPT-4, ChatGPT-3.5 and Gemini's responses to questions about PCOS, its assessment and management based on recommendations from the current international PCOS guideline.

METHODS

This cross-sectional study assessed ChatGPT-4, ChatGPT-3.5, and Gemini's responses to PCOS-related questions created by endocrinologists using the latest guidelines and common patient queries. Experts evaluated the responses for accuracy, quality and tendency to hallucinate using Likert scales, while readability was analyzed using standard formulas.

RESULTS

ChatGPT-4 and ChatGPT-3.5 attained higher scores in accuracy and quality compared to Gemini (p = 0.001, p < 0.001 and p = 0.007, p < 0.001 respectively). However, Gemini obtained a higher readability score compared to the other chatbots (p < 0.001). There was a significant difference between the tendency to hallucinate scores, which were due to the lower scores in Gemini (p = 0.003).

CONCLUSION

The high accuracy and quality of responses provided by ChatGPT-4 and 3.5 to questions about PCOS suggest that they could be supportive in clinical practice. Future technological advancements may facilitate the use of artificial intelligence in both educating patients with PCOS and supporting the management of the disorder.

摘要

背景

人工智能（AI）在医疗保健领域的应用日益广泛，ChatGPT和谷歌Gemini等模型在全球广受欢迎。多囊卵巢综合征（PCOS）是一种常见疾病，需要生活方式调整和药物治疗，这凸显了有效患者教育的迫切需求。本研究使用最新指南比较了ChatGPT-4、ChatGPT-3.5和Gemini对PCOS相关问题的回答。评估人工智能在患者教育中的整合需要评估回答质量、可靠性、可读性以及在管理PCOS方面的有效性。

目的

根据当前国际PCOS指南的建议，评估ChatGPT-4、ChatGPT-3.5和Gemini对PCOS相关问题及其评估和管理的回答的准确性、质量、可读性和产生幻觉的倾向。

方法

这项横断面研究评估了ChatGPT-4、ChatGPT-3.5和Gemini对内分泌学家根据最新指南和常见患者问题提出的PCOS相关问题的回答。专家使用李克特量表评估回答的准确性、质量和产生幻觉的倾向，同时使用标准公式分析可读性。

结果

与Gemini相比，ChatGPT-4和ChatGPT-3.5在准确性和质量方面得分更高（分别为p = 0.001，p < 0.001和p = 0.007，p < 0.001）。然而，与其他聊天机器人相比，Gemini的可读性得分更高（p < 0.001）。产生幻觉得分之间存在显著差异，这是由于Gemini的得分较低（p = 0.003）。

结论

ChatGPT-4和3.5对PCOS相关问题的回答具有较高的准确性和质量，表明它们在临床实践中可能具有辅助作用。未来的技术进步可能会促进人工智能在教育PCOS患者和支持该疾病管理方面的应用。

相似文献

Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline.基于2023年指南建议对ChatGPT-4、ChatGPT-3.5和谷歌Gemini在多囊卵巢综合征评估与管理方面的比较评估

Endocrine. 2025 Apr;88(1):315-322. doi: 10.1007/s12020-024-04121-7. Epub 2024 Dec 2.

Can artificial intelligence models serve as patient information consultants in orthodontics?人工智能模型能否在正畸学中充当患者信息顾问？

BMC Med Inform Decis Mak. 2024 Jul 29;24(1):211. doi: 10.1186/s12911-024-02619-8.

Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study.评估人工智能聊天机器人提供的关于化疗心脏毒性的患者教育材料的质量和可读性：一项观察性横断面研究。

Medicine (Baltimore). 2025 Apr 11;104(15):e42135. doi: 10.1097/MD.0000000000042135.

Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o.人工智能模型在风湿病委员会级问题中的比较性能：评估 Google Gemini 和 ChatGPT-4o。

Clin Rheumatol. 2024 Nov;43(11):3507-3513. doi: 10.1007/s10067-024-07154-5. Epub 2024 Sep 28.

Performance of Artificial Intelligence Chatbots in Responding to Patient Queries Related to Traumatic Dental Injuries: A Comparative Study.人工智能聊天机器人在回应与创伤性牙损伤相关的患者咨询中的表现：一项比较研究。

Dent Traumatol. 2025 Jun;41(3):338-347. doi: 10.1111/edt.13020. Epub 2024 Nov 22.

Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。

Rev Esp Med Nucl Imagen Mol (Engl Ed). 2025 Jan-Feb;44(1):500065. doi: 10.1016/j.remnie.2024.500065. Epub 2024 Sep 28.

Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.评估 ChatGPT®、BARD®、 Gemini®、Copilot®、Perplexity® 在姑息治疗方面的可读性、可靠性和质量。

Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305.

Artificial intelligence chatbots versus traditional medical resources for patient education on "Labor Epidurals": an evaluation of accuracy, emotional tone, and readability.用于“分娩硬膜外麻醉”患者教育的人工智能聊天机器人与传统医学资源的比较：准确性、情感基调及可读性评估

Int J Obstet Anesth. 2025 Feb;61:104302. doi: 10.1016/j.ijoa.2024.104302. Epub 2024 Nov 26.

Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.小儿肱骨髁上骨折和股骨干骨折：Chat生成式预训练变换器与谷歌Gemini建议对比美国矫形外科医师学会临床实践指南的分析

J Pediatr Orthop. 2025 Apr 1;45(4):e338-e344. doi: 10.1097/BPO.0000000000002890. Epub 2025 Jan 14.

Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge.Gemini人工智能与ChatGPT对比：与眼科住院医师一起对医学知识进行的全面考察

Graefes Arch Clin Exp Ophthalmol. 2025 Feb;263(2):527-536. doi: 10.1007/s00417-024-06625-4. Epub 2024 Sep 15.

引用本文的文献

Comparison of the readability of ChatGPT and Bard in medical communication: a meta-analysis.ChatGPT与Bard在医学交流中的可读性比较：一项荟萃分析。

BMC Med Inform Decis Mak. 2025 Sep 1;25(1):325. doi: 10.1186/s12911-025-03035-2.

ChatGPT-4o as a digital health tool for diabetes technology education: insights on reliability, quality, and readability.ChatGPT-4o作为糖尿病技术教育的数字健康工具：关于可靠性、质量和可读性的见解

Endocrine. 2025 Aug 23. doi: 10.1007/s12020-025-04400-x.

本文引用的文献

The long but necessary road to responsible use of large language models in healthcare research.在医疗保健研究中负责任地使用大语言模型的漫长但必要的道路。

NPJ Digit Med. 2024 Jul 4;7(1):177. doi: 10.1038/s41746-024-01180-y.

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.幻觉发生率和 ChatGPT 与 Bard 用于系统评价的参考准确性：比较分析。

J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164.

The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.人工智能聊天机器人大型语言模型在解决骨骼生物学和骨骼健康问题方面的表现。

J Bone Miner Res. 2024 Mar 22;39(2):106-115. doi: 10.1093/jbmr/zjad007.

Anxiety, depression, and body image among infertile women with and without polycystic ovary syndrome.多囊卵巢综合征不孕妇女与非多囊卵巢综合征不孕妇女的焦虑、抑郁和体像。

Hum Reprod. 2024 Apr 3;39(4):784-791. doi: 10.1093/humrep/deae016.

Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy.评估 ChatGPT-4 在妊娠期间甲状腺功能减退症相关问题的回复的可靠性和可读性。

Sci Rep. 2024 Jan 2;14(1):243. doi: 10.1038/s41598-023-50884-w.

ChatGPT and science: the AI system was a force in 2023 - for good and bad.ChatGPT与科学：人工智能系统在2023年是一股力量——有好有坏。

Nature. 2023 Dec;624(7992):509. doi: 10.1038/d41586-023-03930-6.

Credibility of ChatGPT in the assessment of obesity in type 2 diabetes according to the guidelines.根据指南评估 2 型糖尿病患者肥胖时 ChatGPT 的可信度。

Int J Obes (Lond). 2024 Feb;48(2):271-275. doi: 10.1038/s41366-023-01410-5. Epub 2023 Nov 11.

Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review.生成式人工智能在医学教育中的机遇、挑战与未来方向：范围综述

JMIR Med Educ. 2023 Oct 20;9:e48785. doi: 10.2196/48785.

The future landscape of large language models in medicine.医学领域大语言模型的未来前景。

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

Recommendations from the 2023 International Evidence-based Guideline for the Assessment and Management of Polycystic Ovary Syndrome.2023 年多囊卵巢综合征评估和管理国际循证指南推荐意见。

Fertil Steril. 2023 Oct;120(4):767-793. doi: 10.1016/j.fertnstert.2023.07.025. Epub 2023 Aug 14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于2023年指南建议对ChatGPT-4、ChatGPT-3.5和谷歌Gemini在多囊卵巢综合征评估与管理方面的比较评估

Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline.

作者信息

机构信息

出版信息

CONTEXT

PURPOSE

METHODS

RESULTS

CONCLUSION

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献