评估大型语言模型（ChatGPT-4、Claude 3、Gemini和Microsoft Copilot）对早产儿视网膜病变常见问题的回答：一项关于可读性和适宜性的研究

Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.

作者信息

Ermis Serhat, Özal Ece, Karapapak Murat, Kumantaş Ebrar, Özal Sadık Altan

出版信息

J Pediatr Ophthalmol Strabismus. 2025 Mar-Apr;62(2):84-95. doi: 10.3928/01913913-20240911-05. Epub 2024 Oct 28.

DOI:10.3928/01913913-20240911-05

PMID:39465590

Abstract

PURPOSE

To assess the appropriateness and readability of responses provided by four large language models (LLMs) (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to parents' queries pertaining to retinopathy of prematurity (ROP).

METHODS

A total of 60 frequently asked questions were collated and categorized into six distinct sections. The responses generated by the LLMs were evaluated by three experienced ROP specialists to determine their appropriateness and comprehensiveness. Additionally, the readability of the responses was assessed using a range of metrics, including the Flesch-Kincaid Grade Level (FKGL), Gunning Fog (GF) Index, Coleman-Liau (CL) Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease (FRE) score.

RESULTS

ChatGPT-4 demonstrated the highest level of appropriateness (100%) and performed exceptionally well in the Likert analysis, scoring 5 points on 96% of questions. The CL Index and FRE scores identified Gemini as the most readable LLM, whereas the GF Index and SMOG Index rated Microsoft Copilot as the most readable. Nevertheless, ChatGPT-4 exhibited the most intricate text structure, with scores of 18.56 on the GF Index, 18.56 on the CL Index, 17.2 on the SMOG Index, and 9.45 on the FRE score. This suggests that the responses demand a college-level comprehension.

CONCLUSIONS

ChatGPT-4 demonstrated higher performance than other LLMs in responding to questions related to ROP; however, its texts were more complex. In terms of readability, Gemini and Microsoft Copilot were found to be more successful. .

摘要

目的

评估四种大语言模型（LLMs）（ChatGPT-4、Claude 3、Gemini和Microsoft Copilot）对家长有关早产儿视网膜病变（ROP）问题的回答的恰当性和可读性。

方法

共整理了60个常见问题，并将其分为六个不同部分。由三位经验丰富的ROP专家对大语言模型生成的回答进行评估，以确定其恰当性和全面性。此外，使用一系列指标评估回答的可读性，包括弗莱施-金凯德年级水平（FKGL）、冈宁雾度（GF）指数、科尔曼-廖（CL）指数、简单晦涩度测量（SMOG）指数和弗莱施阅读简易度（FRE）得分。

结果

ChatGPT-4表现出最高的恰当性水平（100%），在李克特分析中表现出色，96%的问题得分为5分。CL指数和FRE得分表明Gemini是最易读的大语言模型，而GF指数和SMOG指数则将Microsoft Copilot评为最易读的。然而，ChatGPT-4展现出最复杂的文本结构，GF指数得分为18.56，CL指数得分为18.56，SMOG指数得分为17.2，FRE得分9.45。这表明回答需要大学水平的理解能力。

结论

ChatGPT-4在回答与ROP相关问题上比其他大语言模型表现更好；然而，其文本更复杂。在可读性方面，Gemini和Microsoft Copilot更成功。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估大型语言模型（ChatGPT-4、Claude 3、Gemini和Microsoft Copilot）对早产儿视网膜病变常见问题的回答：一项关于可读性和适宜性的研究

Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.

作者信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

评估大型语言模型（ChatGPT-4、Claude 3、Gemini和Microsoft Copilot）对早产儿视网膜病变常见问题的回答：一项关于可读性和适宜性的研究

Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.

作者信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献