Khromchenko Keren, Shaikh Sameeha, Singh Meghana, Vurture Gregory, Rana Rima A, Baum Jonathan D
Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA.
Obstetrics and Gynecology, Hackensack Meridian School of Medicine, Nutley, USA.
Cureus. 2024 Jul 27;16(7):e65543. doi: 10.7759/cureus.65543. eCollection 2024 Jul.
Large language models (LLM) have been widely used to provide information in many fields, including obstetrics and gynecology. Which model performs best in providing answers to commonly asked pregnancy questions is unknown. A qualitative analysis of Chat Generative Pre-Training Transformer Version 3.5 (ChatGPT-3.5) (OpenAI, Inc., San Francisco, California, United States) and Bard, recently renamed Google Gemini (Google LLC, Mountain View, California, United States), was performed in August of 2023. Each LLM was queried on 12 commonly asked pregnancy questions and asked for their references. Review and grading of the responses and references for both LLMs were performed by the co-authors individually and then as a group to formulate a consensus. Query responses were graded as "acceptable" or "not acceptable" based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as "verified," "broken," "irrelevant," "non-existent," and "no references." Grades of "acceptable" were given to 58% of ChatGPT-3.5 responses (seven out of 12) and 83% of Bard responses (10 out of 12). In regard to references, ChatGPT-3.5 had reference issues in 100% of its references, and Bard had discrepancies in 8% of its references (one out of 12). When comparing ChatGPT-3.5 responses between May 2023 and August 2023, a change in "acceptable" responses was noted: 50% versus 58%, respectively. Bard answered more questions correctly than ChatGPT-3.5 when queried on a small sample of commonly asked pregnancy questions. ChatGPT-3.5 performed poorly in terms of reference verification. The overall performance of ChatGPT-3.5 remained stable over time, with approximately one-half of responses being "acceptable" in both May and August of 2023. Both LLMs need further evaluation and vetting before being accepted as accurate and reliable sources of information for pregnant women.
大语言模型(LLM)已被广泛应用于包括妇产科在内的许多领域来提供信息。目前尚不清楚哪种模型在回答常见的怀孕问题方面表现最佳。2023年8月对Chat Generative Pre-Training Transformer Version 3.5(ChatGPT-3.5)(美国加利福尼亚州旧金山的OpenAI公司)和最近更名为谷歌Gemini(美国加利福尼亚州山景城的谷歌有限责任公司)的Bard进行了定性分析。每个大语言模型都被询问了12个常见的怀孕问题,并要求提供参考文献。两位共同作者先分别然后作为一个小组对两个大语言模型的回答和参考文献进行审查和评分,以达成共识。根据与美国妇产科医师学会(ACOG)出版物、PubMed索引证据和临床经验相比的正确性和完整性,查询回复被评为“可接受”或“不可接受”。参考文献被分类为“已验证”、“损坏”、“不相关”、“不存在”和“无参考文献”。ChatGPT-3.5的回复中有58%(12个中的7个)被评为“可接受”,Bard的回复中有83%(12个中的10个)被评为“可接受”。在参考文献方面,ChatGPT-3.5的所有参考文献都存在问题,而Bard的参考文献中有8%(12个中的1个)存在差异。比较2023年5月至8月ChatGPT-3.5的回复时,注意到“可接受”回复的变化:分别为50%和58%。在对一小部分常见怀孕问题进行查询时,Bard回答正确的问题比ChatGPT-3.5多。ChatGPT-3.5在参考文献验证方面表现不佳。ChatGPT-3.5的总体性能随时间保持稳定,在2023年5月和8月,大约一半的回复是“可接受”的。在被接受为孕妇准确可靠的信息来源之前,这两个大语言模型都需要进一步评估和审查。