ChatGPT-3.5与谷歌巴德：哪种大语言模型对常见的怀孕问题回答得最好？

ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?

作者信息

Khromchenko Keren, Shaikh Sameeha, Singh Meghana, Vurture Gregory, Rana Rima A, Baum Jonathan D

机构信息

Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA.

Obstetrics and Gynecology, Hackensack Meridian School of Medicine, Nutley, USA.

出版信息

Cureus. 2024 Jul 27;16(7):e65543. doi: 10.7759/cureus.65543. eCollection 2024 Jul.

DOI:10.7759/cureus.65543

PMID:39188430

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11346960/

Abstract

Large language models (LLM) have been widely used to provide information in many fields, including obstetrics and gynecology. Which model performs best in providing answers to commonly asked pregnancy questions is unknown. A qualitative analysis of Chat Generative Pre-Training Transformer Version 3.5 (ChatGPT-3.5) (OpenAI, Inc., San Francisco, California, United States) and Bard, recently renamed Google Gemini (Google LLC, Mountain View, California, United States), was performed in August of 2023. Each LLM was queried on 12 commonly asked pregnancy questions and asked for their references. Review and grading of the responses and references for both LLMs were performed by the co-authors individually and then as a group to formulate a consensus. Query responses were graded as "acceptable" or "not acceptable" based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as "verified," "broken," "irrelevant," "non-existent," and "no references." Grades of "acceptable" were given to 58% of ChatGPT-3.5 responses (seven out of 12) and 83% of Bard responses (10 out of 12). In regard to references, ChatGPT-3.5 had reference issues in 100% of its references, and Bard had discrepancies in 8% of its references (one out of 12). When comparing ChatGPT-3.5 responses between May 2023 and August 2023, a change in "acceptable" responses was noted: 50% versus 58%, respectively. Bard answered more questions correctly than ChatGPT-3.5 when queried on a small sample of commonly asked pregnancy questions. ChatGPT-3.5 performed poorly in terms of reference verification. The overall performance of ChatGPT-3.5 remained stable over time, with approximately one-half of responses being "acceptable" in both May and August of 2023. Both LLMs need further evaluation and vetting before being accepted as accurate and reliable sources of information for pregnant women.

摘要

大语言模型（LLM）已被广泛应用于包括妇产科在内的许多领域来提供信息。目前尚不清楚哪种模型在回答常见的怀孕问题方面表现最佳。2023年8月对Chat Generative Pre-Training Transformer Version 3.5（ChatGPT-3.5）（美国加利福尼亚州旧金山的OpenAI公司）和最近更名为谷歌Gemini（美国加利福尼亚州山景城的谷歌有限责任公司）的Bard进行了定性分析。每个大语言模型都被询问了12个常见的怀孕问题，并要求提供参考文献。两位共同作者先分别然后作为一个小组对两个大语言模型的回答和参考文献进行审查和评分，以达成共识。根据与美国妇产科医师学会（ACOG）出版物、PubMed索引证据和临床经验相比的正确性和完整性，查询回复被评为“可接受”或“不可接受”。参考文献被分类为“已验证”、“损坏”、“不相关”、“不存在”和“无参考文献”。ChatGPT-3.5的回复中有58%（12个中的7个）被评为“可接受”，Bard的回复中有83%（12个中的10个）被评为“可接受”。在参考文献方面，ChatGPT-3.5的所有参考文献都存在问题，而Bard的参考文献中有8%（12个中的1个）存在差异。比较2023年5月至8月ChatGPT-3.5的回复时，注意到“可接受”回复的变化：分别为50%和58%。在对一小部分常见怀孕问题进行查询时，Bard回答正确的问题比ChatGPT-3.5多。ChatGPT-3.5在参考文献验证方面表现不佳。ChatGPT-3.5的总体性能随时间保持稳定，在2023年5月和8月，大约一半的回复是“可接受”的。在被接受为孕妇准确可靠的信息来源之前，这两个大语言模型都需要进一步评估和审查。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

ChatGPT-3.5与谷歌巴德：哪种大语言模型对常见的怀孕问题回答得最好？

ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

ChatGPT-3.5与谷歌巴德：哪种大语言模型对常见的怀孕问题回答得最好？

ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?

作者信息

机构信息

出版信息