Suppr超能文献

眼科领域中的ChatGPT-3.5和必应聊天:性能、可读性及信息来源的最新评估

ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources.

作者信息

Tao Brendan Ka-Lok, Hua Nicholas, Milkovich John, Micieli Jonathan Andrew

机构信息

Faculty of Medicine, The University of British Columbia, 317-2194 Health Sciences Mall, Vancouver, BC, V6T 1Z3, Canada.

Temerty Faculty of Medicine, University of Toronto, 1 King's College Circle, Toronto, ON, M5S 1A8, Canada.

出版信息

Eye (Lond). 2024 Jul;38(10):1897-1902. doi: 10.1038/s41433-024-03037-w. Epub 2024 Mar 20.

Abstract

BACKGROUND/OBJECTIVES: Experimental investigation. Bing Chat (Microsoft) integration with ChatGPT-4 (OpenAI) integration has conferred the capability of accessing online data past 2021. We investigate its performance against ChatGPT-3.5 on a multiple-choice question ophthalmology exam.

SUBJECTS/METHODS: In August 2023, ChatGPT-3.5 and Bing Chat were evaluated against 913 questions derived from the Academy's Basic and Clinical Science Collection collection. For each response, the sub-topic, performance, Simple Measure of Gobbledygook readability score (measuring years of required education to understand a given passage), and cited resources were collected. The primary outcomes were the comparative scores between models, and qualitatively, the resources referenced by Bing Chat. Secondary outcomes included performance stratified by response readability, question type (explicit or situational), and BCSC sub-topic.

RESULTS

Across 913 questions, ChatGPT-3.5 scored 59.69% [95% CI 56.45,62.94] while Bing Chat scored 73.60% [95% CI 70.69,76.52]. Both models performed significantly better in explicit than clinical reasoning questions. Both models performed best on general medicine questions than ophthalmology subsections. Bing Chat referenced 927 online entities and provided at-least one citation to 836 of the 913 questions. The use of more reliable (peer-reviewed) sources was associated with higher likelihood of correct response. The most-cited resources were eyewiki.aao.org, aao.org, wikipedia.org, and ncbi.nlm.nih.gov. Bing Chat showed significantly better readability than ChatGPT-3.5, averaging a reading level of grade 11.4 [95% CI 7.14, 15.7] versus 12.4 [95% CI 8.77, 16.1], respectively (p-value < 0.0001, ρ = 0.25).

CONCLUSIONS

The online access, improved readability, and citation feature of Bing Chat confers additional utility for ophthalmology learners. We recommend critical appraisal of cited sources during response interpretation.

摘要

背景/目的:实验研究。必应聊天(微软)与ChatGPT-4(OpenAI)的集成使其能够获取2021年以后的在线数据。我们在一项眼科选择题考试中研究了它与ChatGPT-3.5相比的表现。

受试者/方法:2023年8月,针对从眼科学会基础与临床科学合集衍生出的913道问题对ChatGPT-3.5和必应聊天进行了评估。对于每个回答,收集了子主题、表现、复杂程度简易评分(衡量理解一段给定文字所需的教育年限)以及引用的资源。主要结果是模型之间的比较分数,以及定性地看必应聊天引用的资源。次要结果包括按回答可读性、问题类型(明确或情境性)以及基础与临床科学合集子主题分层的表现。

结果

在913道问题中,ChatGPT-3.5的得分率为59.69% [95%置信区间56.45, 62.94],而必应聊天的得分率为73.60% [95%置信区间70.69, 76.52]。在明确性问题上,两个模型的表现均显著优于临床推理问题。在普通医学问题上,两个模型的表现均优于眼科部分。必应聊天引用了927个在线实体,并为913道问题中的836道提供了至少一条引用。使用更可靠(经过同行评审)的来源与正确回答的可能性更高相关。被引用最多的资源是eyewiki.aao.org、aao.org、wikipedia.org和ncbi.nlm.nih.gov。必应聊天的可读性显著优于ChatGPT-3.5,平均阅读水平分别为11.4年级[95%置信区间7.14, 15.7]和12.4年级[95%置信区间8.77, 16.1](p值<0.0001,ρ = 0.25)。

结论

必应聊天的在线访问、提高的可读性和引用功能为眼科学习者带来了额外的实用价值。我们建议在解读回答时对引用的来源进行批判性评估。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验