From the Bascom Palmer Eye Institute, Miami, Florida, USA (L.Z.C., A.S., J.S.Y., N.Y., C.A.).
From the Bascom Palmer Eye Institute, Miami, Florida, USA (L.Z.C., A.S., J.S.Y., N.Y., C.A.).
Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.
To investigate the ability of generative artificial intelligence models to answer ophthalmology board-style questions.
Experimental study.
This study evaluated 3 large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or nonlogical reasoning were documented.
Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.
Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic questions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < .01) when compared with single-step reasoning questions. Against single-step questions, Bing Chat also faced difficulties with image interpretation (OR, 0.18, 95% CI, 0.08-0.44, P < .01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of hallucinations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%).
LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The frequency of hallucinations and nonlogical reasoning suggests room for improvement in the performance of conversational agents in the medical domain.
研究生成式人工智能模型回答眼科 board-style 问题的能力。
实验研究。
本研究评估了 3 个具有聊天界面的大型语言模型(LLMs),即 Bing Chat(微软)和 ChatGPT 3.5 和 4.0(OpenAI),使用了来自基础科学和临床科学自我评估计划的 250 个问题。虽然 ChatGPT 是基于 2021 年最后更新的信息进行训练的,但 Bing Chat 结合了最近索引的互联网搜索来生成其答案。将性能与人类受访者进行了比较。问题按复杂性和患者护理阶段进行分类,并记录了信息捏造或非逻辑推理的实例。
主要观察指标是回答的准确性。次要观察指标是问题子类别和幻觉频率的表现。
人类受访者的平均准确率为 72.2%。ChatGPT-3.5 的得分最低(58.8%),而 ChatGPT-4.0(71.6%)和 Bing Chat(71.2%)的表现相当。ChatGPT-4.0 在工作流程类型的问题上表现出色(比值比[OR],3.89,95%置信区间[CI],1.19-14.73,P =.03),而在诊断问题上表现不佳,但在与单步推理问题相比,图像解释方面存在困难(OR,0.14,95%CI,0.05-0.33,P <.01)。与单步问题相比,Bing Chat 也在图像解释(OR,0.18,95%CI,0.08-0.44,P <.01)和多步推理(OR,0.30,95%CI,0.11-0.84,P =.02)方面遇到困难。ChatGPT-3.5 出现幻觉和非逻辑推理的比例最高(42.4%),其次是 ChatGPT-4.0(18.0%)和 Bing Chat(25.6%)。
LLMs(特别是 ChatGPT-4.0 和 Bing Chat)可以与回答基础科学和临床科学自我评估计划问题的人类受访者表现相当。幻觉和非逻辑推理的频率表明,对话代理在医疗领域的性能还有改进的空间。