使用大型语言模型获取临床信息的前景与风险:ChatGPT 在作为生育咨询工具方面表现强劲,但存在一定局限性。

The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations.

机构信息

Albert Einstein College of Medicine/Montefiore's Institute for Reproductive Medicine and Health, Hartsdale, New York.

Albert Einstein College of Medicine/Montefiore's Institute for Reproductive Medicine and Health, Hartsdale, New York.

出版信息

Fertil Steril. 2023 Sep;120(3 Pt 2):575-583. doi: 10.1016/j.fertnstert.2023.05.151. Epub 2023 May 20.

Abstract

OBJECTIVE

To compare the responses of the large language model-based "ChatGPT" to reputable sources when given fertility-related clinical prompts.

DESIGN

The "Feb 13" version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 "frequently asked questions (FAQs)" about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion "optimizing natural fertility."

SETTING

Academic medical center.

PATIENT(S): Online AI Chatbot.

INTERVENTION(S): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023.

MAIN OUTCOME MEASURE(S): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers.

FOR FERTILITY KNOWLEDGE SURVEYS

Percentile according to published population data.

FOR COMMITTEE OPINION

Whether response to conclusions rephrased as questions identified missing facts.

RESULT(S): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from "optimizing natural fertility."

CONCLUSION(S): A February 2023 version of "ChatGPT" demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.

摘要

目的

比较基于大型语言模型的“ChatGPT”在提供与生育相关的临床提示时对可靠来源的反应。

设计

对 OpenAI 的“ChatGPT”的“Feb 13”版本进行了测试,以评估其与面向患者的临床信息相关的既定来源的反应:美国疾病控制与预防中心(CDC)网站上的 17 个关于不孕不育的“常见问题解答(FAQ)”、2 个经过验证的生育知识调查、加的夫生育知识量表和生育与不孕治疗知识评分,以及美国生殖医学学会委员会的意见“优化自然生育能力”。

地点

学术医疗中心。

患者

在线 AI 聊天机器人。

干预措施

在 2023 年 2 月的一周内,将常见问题解答、调查问题和改写的摘要陈述作为提示输入聊天机器人。

主要观察指标

对于 CDC 的常见问题解答:字数/回复、情感分析极性和客观性、总事实陈述、不正确陈述的比例、引用的来源或注意咨询提供者的价值。

对于生育知识调查

根据已发表的人群数据计算的百分位数。

对于委员会意见

对重新表述为问题的结论的反应是否确定缺失的事实。

结果

当对 CDC 的 17 个不孕不育常见问题解答进行管理时,ChatGPT 的回复长度相似(ChatGPT 为 207.8 字/回复,CDC 为 181.0 字/回复)、事实内容(8.65 条事实陈述/回复,CDC 为 10.41 条)、情感极性(均值为-1(负)至 1(正)的 0.11)和主观性(0(客观)至 1(主观)的 0.42)。总的来说,ChatGPT 的 147 个事实陈述中有 9 个(6.12%)被归类为不正确,只有 1 个(0.68%)陈述引用了参考资料。ChatGPT 将在 Bunting 2013 年国际队列的卡迪夫生育知识量表上达到第 87 百分位,在 Kudesia 2017 年队列的生育和不孕治疗知识量表上达到第 95 百分位。ChatGPT 复制了“优化自然生育能力”中所有 7 个摘要陈述的缺失事实。

结论

2023 年 2 月的“ChatGPT”版本展示了生成式人工智能在提供与生育相关的临床查询方面的能力,与既定来源相比,其能够生成相关且有意义的回复。尽管通过特定于医学领域的培训可能会提高性能,但它的局限性,如无法可靠地引用来源以及不可预测的伪造信息的可能性,可能会限制其在临床中的使用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索