Kayabaşı Mustafa, Köksaldı Seher, Durmaz Engin Ceren
Department of Ophthalmology, Mus State Hospital, Mus, Turkey.
Department of Ophthalmology, Izmir Democracy University Buca Seyfi Demirsoy Education and Research Hospital, Izmir, Turkey.
Clin Exp Optom. 2024 Oct 24:1-8. doi: 10.1080/08164622.2024.2419524.
Artificial intelligence has undergone a rapid evolution and large language models (LLMs) have become promising tools for healthcare, with the ability of providing human-like responses to questions. The capabilities of these tools in addressing questions related to keratoconus (KCN) have not been previously explored.
In this study, the responses were evaluated from three LLMs - ChatGPT-4, Copilot, and Gemini - to common patient questions regarding KCN.
Fifty real-life patient inquiries regarding general information, aetiology, symptoms and diagnosis, progression, and treatment of KCN were presented to the LLMs. Evaluations of the answers were conducted by three ophthalmologists with a 5-point Likert scale ranging from 'strongly disagreed' to 'strongly agreed'. The reliability of the responses provided by LLMs was evaluated using the DISCERN and the Ensuring Quality Information for Patients (EQIP) scales. Readability metrics (Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index) were calculated to evaluate the complexity of responses.
ChatGPT-4 consistently scored 3 points or higher for all (100%) its responses, while Copilot had five (10%) and Gemini had two (4%) responses scoring 2 points or below. ChatGPT-4 achieved a 'strongly agree' rate of 74% across all questions, markedly superior to Copilot at 34% and Gemini at 42% ( < 0.001); and recorded the highest 'strongly agree' rates in general information and symptoms & diagnosis categories (90% for both). The median Likert scores differed among LLMs ( < 0.001), with ChatGPT-4 scoring highest and Copilot scoring lowest. Although ChatGPT-4 exhibited more reliability based on the DISCERN scale, it was characterised by lower readability and higher complexity. While all LLMs provided responses categorised as 'extremely difficult to read', the responses provided by Copilot showed higher readability.
Despite the responses provided by ChatGPT-4 exhibiting lower readability and greater complexity, it emerged as the most proficient in answering KCN-related questions.
人工智能经历了快速发展,大语言模型(LLMs)已成为医疗保健领域颇具前景的工具,能够对问题给出类似人类的回答。此前尚未探索过这些工具在解决与圆锥角膜(KCN)相关问题方面的能力。
在本研究中,评估了三种大语言模型——ChatGPT-4、Copilot和Gemini——对有关KCN的常见患者问题的回答。
向大语言模型提出了50个关于KCN的一般信息、病因、症状与诊断、病情进展及治疗的实际患者询问。由三位眼科医生采用从“强烈不同意”到“强烈同意”的5级李克特量表对答案进行评估。使用DISCERN和患者质量信息保障(EQIP)量表评估大语言模型提供的回答的可靠性。计算可读性指标(弗莱什易读性分数、弗莱什-金凯德年级水平和科尔曼-廖指数)以评估回答的复杂性。
ChatGPT-4的所有回答(100%)一致获得3分或更高分数,而Copilot有5个(10%)回答、Gemini有2个(4%)回答得分为2分或更低。ChatGPT-4在所有问题上的“强烈同意”率为74%,明显高于Copilot的34%和Gemini的42%(P<0.001);并且在一般信息和症状与诊断类别中记录了最高的“强烈同意”率(两者均为90%)。大语言模型之间的李克特中位数得分存在差异(P<0.001),ChatGPT-4得分最高,Copilot得分最低。尽管根据DISCERN量表ChatGPT-4表现出更高的可靠性,但其特点是可读性较低且复杂性较高。虽然所有大语言模型提供的回答都被归类为“极难阅读”,但Copilot提供的回答显示出更高的可读性。
尽管ChatGPT-4提供的回答可读性较低且复杂性较高,但它在回答与KCN相关的问题方面表现最为出色。