de Vries P L M, Baud D, Baggio S, Ceulemans M, Favre G, Gerbier E, Legardeur H, Maisonneuve E, Pena-Reyes C, Pomar L, Winterfeld U, Panchaud A
Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland.
Institute of Primary Health Care (BIHAM), University of Bern, Bern, Switzerland.
PEC Innov. 2025 Feb 10;6:100381. doi: 10.1016/j.pecinn.2025.100381. eCollection 2025 Jun.
To evaluate ChatGPT's accuracy as information source for women and maternity-care workers on "nutrition" and "red flags" in pregnancy.
Accuracy of ChatGPT-generated recommendations was assessed by a 5-point Likert scale by eight raters for ten indicators per topic in four languages (French, English, German and Dutch). Accuracy and interrater agreement were calculated per topic and language.
For both topics, median accuracy scores of ChatGPT-generated recommendations were excellent (5.0; IQR 4-5) independently of language. Median accuracy scores varied with a maximum of 1 on a 5-point Likert-scare according to question's framing. Overall accuracy scores were 83-89 % for 'nutrition in pregnancy' versus 96-98 % for 'red flags in pregnancy'. Inter-rater agreement was good to excellent for both topics.
Although ChatGPT generated accurate recommendations regarding the tested indicators for nutrition and red flags during pregnancy, women should be aware of ChatGPT's limitations such as inconsistencies according to formulation, language and the woman's personal context.
Despite a growing interest in the potential use of artificial intelligence in healthcare, this is, to the best of our knowledge, the first study assessing potential limitations that may impact accuracy of ChatGPT-generated recommendations such as language and question-framing in key domains of perinatal health.
评估ChatGPT作为女性及孕产护理人员获取孕期“营养”和“危险信号”信息来源的准确性。
由八位评分者使用5级李克特量表,对ChatGPT生成的建议在四种语言(法语、英语、德语和荷兰语)下每个主题的十个指标进行准确性评估。计算每个主题和语言的准确性及评分者间一致性。
对于两个主题,ChatGPT生成建议的中位数准确性得分均为优秀(5.0;四分位距4 - 5),与语言无关。根据问题的框架,中位数准确性得分在5级李克特量表上最多相差1分。“孕期营养”的总体准确性得分是83 - 89%,而“孕期危险信号”为96 - 98%。两个主题的评分者间一致性均为良好到优秀。
尽管ChatGPT针对孕期营养和危险信号的测试指标生成了准确的建议,但女性应意识到ChatGPT的局限性,如因表述、语言和女性个人情况而产生的不一致性。
尽管人们对人工智能在医疗保健中的潜在应用兴趣日益浓厚,但据我们所知,这是第一项评估可能影响ChatGPT生成建议准确性的潜在局限性的研究,这些局限性包括围产期健康关键领域中的语言和问题框架。