ChatGPT-4和必应聊天在青光眼常见问题方面的表现。

PurposeTo evaluate the appropriateness and readability of the responses generated by ChatGPT-4 and Bing Chat to frequently asked questions about glaucoma.MethodThirty-four questions were generated for this study. Each question was directed three times to a fresh ChatGPT-4 and Bing Chat interface. The obtained responses were categorised by two glaucoma specialists in terms of their appropriateness. Accuracy of the responses was evaluated using the Structure of the Observed Learning Outcome (SOLO) taxonomy. Readability of the responses was assessed using Flesch Reading Ease (FRE), Flesch Kincaid Grade Level (FKGL), Coleman-Liau Index (CLI), Simple Measure of Gobbledygook (SMOG), and Gunning- Fog Index (GFI).ResultsThe percentage of appropriate responses was 88.2% (30/34) and 79.2% (27/34) in ChatGPT-4 and Bing Chat, respectively. Both the ChatGPT-4 and Bing Chat interfaces provided at least one inappropriate response to 1 of the 34 questions. The SOLO test results for ChatGPT-3.5 and Bing Chat were 3.86 ± 0.41 and 3.70 ± 0.52, respectively. No statistically significant difference in performance was observed between both LLMs ( = 0.101). The mean count of words used when generating responses was 316.5 (± 85.1) and 61.6 (± 25.8) in ChatGPT-4 and Bing Chat, respectively ( < 0.05). According to FRE scores, the generated responses were suitable for only 4.5% and 33% of U.S. adults in ChatGPT-4 and Bing Chat, respectively ( < 0.05).ConclusionsChatGPT-4 and Bing Chat consistently provided appropriate responses to the questions. Both LLMs had low readability scores, but ChatGPT-4 provided more difficult responses in terms of readability.

目的

评估ChatGPT-4和必应聊天（Bing Chat）对青光眼常见问题生成的回答的恰当性和可读性。

方法

本研究生成了34个问题。每个问题分别向全新的ChatGPT-4和必应聊天界面提问三次。两名青光眼专家根据回答的恰当性对获得的回答进行分类。使用观察学习成果结构（SOLO）分类法评估回答的准确性。使用弗莱什易读性（FRE）、弗莱什-金凯德年级水平（FKGL）、科尔曼-廖指数（CLI）、简化晦涩度测量（SMOG）和冈宁-福格指数（GFI）评估回答的可读性。

结果

ChatGPT-4和必应聊天中恰当回答的比例分别为88.2%（30/34）和79.2%（27/34）。ChatGPT-4和必应聊天界面针对34个问题中的1个至少提供了一个不恰当的回答。ChatGPT-3.5和必应聊天的SOLO测试结果分别为3.86±0.41和3.70±0.52。两种语言模型之间在性能上未观察到统计学显著差异（P = 0.101）。ChatGPT-4和必应聊天在生成回答时使用的平均单词数分别为316.5（±85.1）和61.6（±25.8）（P < 0.05）。根据FRE分数，ChatGPT-4和必应聊天生成的回答分别仅适合4.5%和33%的美国成年人（P < 0.05）。

结论

ChatGPT-4和必应聊天对问题始终提供了恰当的回答。两种语言模型的可读性得分都较低，但ChatGPT-4在可读性方面提供的回答更难理解。

The performance of ChatGPT-4 and Bing Chat in frequently asked questions about glaucoma.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献