ChatGPT在验光与视觉科学考试问题上的表现。

Performance of ChatGPT on optometry and vision science exam questions.

作者信息

Yoshioka Nayuta, Honson Vanessa, Mani Revathy, Oberstein Sharon, Watt Kathleen, Maseedupally Vinod

机构信息

School of Optometry and Vision Science, UNSW Australia, Sydney, New South Wales, Australia.

出版信息

Ophthalmic Physiol Opt. 2025 Sep;45(6):1376-1388. doi: 10.1111/opo.13544. Epub 2025 Jul 9.

DOI:10.1111/opo.13544

PMID:40631633

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12357226/

Abstract

The rapid proliferation of Large Language Models (LLM) tools, such as ChatGPT developed by OpenAI, presents both a challenge and an opportunity for educators. While LLMs can generate convincing written responses across a wide range of academic fields, their capabilities vary noticeably across different models, fields and even sub-fields. This paper aims to evaluate the capabilities of LLMs in the field of optometry and vision science by analysing the quality of the responses generated by ChatGPT using sample long answer questions covering different sub-fields of optometry, namely binocular vision, clinical communication, dispensing and ocular pathology. It also seeks to explore the possibility of LLMs being used as virtual graders. The capabilities of ChatGPT were explored utilising various GPT models (GPT-3.5, GPT-4 and o1 models, from oldest to newest) by investigating the concordance between ChatGPT and a human grader. This was followed by benchmarking the performance of these GPT models to various sample questions in optometry and vision science. Statistical analyses include mixed-effect analysis and the Friedman test, Wilcoxon signed-rank test and thematic analysis. ChatGPT graders awarded higher marks compared to human graders, but significant only for GPT-3.5 (p < 0.05). Benchmarking on sample questions demonstrated that all GPT models can generate satisfactory responses above the 50% 'pass' score in many cases (p < 0.05), albeit with the performance varying significantly across different sub-fields (p < 0.0001) and models (p = 0.0003). Newer models significantly outperformed older models in most cases. The frequency of thematic response errors was more mixed between GPT-3.5 and GPT-4 models (p < 0.05 to p > 0.99), while o1 made no thematic errors. These findings indicate ChatGPT may impact learning and teaching practices in this field. The inconsistent performances across sub-fields and additional implementation considerations, such as ethics and transparency, support a judicious adaptation of assessment practice and adoption of the technology in optometry and vision science education.

摘要

诸如OpenAI开发的ChatGPT之类的大语言模型（LLM）工具的迅速普及，对教育工作者来说既是挑战也是机遇。虽然大语言模型能够在广泛的学术领域生成令人信服的书面回答，但其能力在不同模型、领域甚至子领域之间存在显著差异。本文旨在通过分析ChatGPT针对涵盖验光不同子领域（即双眼视觉、临床沟通、配镜和眼部病理学）的示例长答题所生成回答的质量，来评估大语言模型在验光与视觉科学领域的能力。本文还旨在探索大语言模型用作虚拟评分者的可能性。通过研究ChatGPT与人工评分者之间的一致性，利用各种GPT模型（从最旧到最新的GPT-3.5、GPT-4和o1模型）探索了ChatGPT的能力。随后，将这些GPT模型针对验光与视觉科学中的各种示例问题的表现进行了基准测试。统计分析包括混合效应分析、弗里德曼检验、威尔科克森符号秩检验和主题分析。与人工评分者相比，ChatGPT评分者给出的分数更高，但仅GPT-3.5的情况具有显著性（p < 0.05）。对示例问题的基准测试表明，在许多情况下，所有GPT模型都能生成令人满意的、高于50%“及格”分数的回答（p < 0.05），尽管其表现因不同子领域（p < 0.0001）和模型（p = 0.0003）而有显著差异。在大多数情况下，较新的模型明显优于较旧的模型。GPT-3.5和GPT-4模型之间主题回答错误的频率更为复杂（p < 0.05至p > 0.99），而o1没有主题错误。这些发现表明ChatGPT可能会影响该领域的学习和教学实践。各子领域表现的不一致以及诸如伦理和透明度等其他实施方面的考虑，支持在验光与视觉科学教育中审慎调整评估实践并采用该技术。