Bragazzi Nicola Luigi, Buchinger Michèle, Atwan Hisham, Tuma Ruba, Chirico Francesco, Szarpak Lukasz, Farah Raymond, Khamisy-Farah Rola
Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, Toronto, ON, Canada.
Department of Computer Science, Data Science, and Information Technology, Faculty of Natural and Applied Sciences, Sol Plaatje University, Kimberley, South Africa.
JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.
The COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an "infodemic" of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing.
This study aimed to assess LLMs' proficiency, clarity, and objectivity regarding COVID-19's impacts on pregnancy.
This study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted.
In terms of LLMs' knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (-4), followed by ChatGPT-4 (-6) and Google Bard (-7), while ChatGPT-3.5 obtained the most negative score (-12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses.
The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.
新冠疫情使全球医疗系统不堪重负,导致患者大量涌入,资源限制问题加剧。与此同时,虚假信息的“信息疫情”出现,在女性健康领域尤为普遍。这一挑战对医疗服务提供者,尤其是妇科医生和产科医生管理孕妇健康至关重要。疫情增加了孕妇感染新冠病毒的风险,需要专家在疫苗安全性与已知风险之间提供平衡的建议。此外,生成式人工智能(AI)的出现,如大语言模型(LLMs),为医疗保健提供了有前景的支持。然而,它们需要严格测试。
本研究旨在评估大语言模型在新冠疫情对妊娠影响方面的专业程度、清晰度和客观性。
本研究使用在159名以色列妇科医生和产科医生中验证过的问卷中的零样本提示,评估4种主要的人工智能原型(ChatGPT-3.5、ChatGPT-4、Microsoft Copilot和Google Bard)。该问卷评估在提供与妊娠相关的新冠疫情准确信息方面的专业程度。还进行了文本挖掘、情感分析和可读性分析(弗莱什-金凯德年级水平和弗莱什阅读易读性得分)。
在大语言模型的知识方面,ChatGPT-4和Microsoft Copilot各得97%(32/33),Google Bard得94%(31/33),ChatGPT-3.5得82%(27/33)。ChatGPT-4错误地指出新冠疫情导致流产风险增加。Google Bard和Microsoft Copilot在新冠病毒传播和并发症方面有轻微不准确之处。在情感分析中,Microsoft Copilot的负得分最低(-4),其次是ChatGPT-4(-6)和Google Bard(-7),而ChatGPT-3.5的负得分最高(-12)。最后,在可读性分析方面,弗莱什-金凯德年级水平和弗莱什阅读易读性得分显示,Microsoft Copilot最易理解,分别为9.9和49,其次是ChatGPT-4,为12.4和37.1,而ChatGPT-3.5(12.9和35.6)和Google Bard(12.9和35.8)生成的回复特别复杂。
该研究突出了大语言模型在新冠疫情与妊娠相关知识水平上的差异。ChatGPT-3.5的知识和与科学证据的一致性最低。可读性和复杂性分析表明,每种人工智能的方法是针对特定受众量身定制的,ChatGPT版本更适合专业读者,Microsoft Copilot适合普通大众。情感分析揭示了大语言模型传达关键信息方式的显著差异,强调了中立和客观的医疗保健沟通在确保孕妇(尤其是在新冠疫情期间特别脆弱的孕妇)获得准确和安心指导方面的重要作用。总体而言,ChatGPT-4、Microsoft Copilot和Google Bard通常提供了关于新冠疫情和疫苗在母婴健康方面的准确、最新信息,与健康指南一致。该研究证明了人工智能在补充医疗保健知识方面的潜在作用,需要不断更新和验证人工智能知识库。人工智能工具的选择应考虑目标受众和所需信息细节水平。