Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China.
Institute of Hospital Development Strategy, China Hospital Development Institute, Shanghai Jiao Tong University, Shanghai, China.
Br J Ophthalmol. 2024 Sep 20;108(10):1390-1397. doi: 10.1136/bjo-2023-324526.
Large language models (LLMs), such as ChatGPT, have considerable implications for various medical applications. However, ChatGPT's training primarily draws from English-centric internet data and is not tailored explicitly to the medical domain. Thus, an ophthalmic LLM in Chinese is clinically essential for both healthcare providers and patients in mainland China.
We developed an LLM of ophthalmology (MOPH) using Chinese corpora and evaluated its performance in three clinical scenarios: ophthalmic board exams in Chinese, answering evidence-based medicine-oriented ophthalmic questions and diagnostic accuracy for clinical vignettes. Additionally, we compared MOPH's performance to that of human doctors.
In the ophthalmic exam, MOPH's average score closely aligned with the mean score of trainees (64.7 (range 62-68) vs 66.2 (range 50-92), p=0.817), but achieving a score above 60 in all seven mock exams. In answering ophthalmic questions, MOPH demonstrated an adherence of 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4-5). Only 6.7% (2/30, Likert scale 1-2) and 10% (3/30, Likert scale 3) of responses were rated as 'poor or very poor' or 'potentially misinterpretable inaccuracies' by reviewers. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH (96.1% vs 81.1%, p>0.05), the difference was not statistically significant.
This study demonstrated the promising performance of MOPH, a Chinese-specific ophthalmic LLM, in diverse clinical scenarios. MOPH has potential real-world applications in Chinese-language ophthalmology settings.
大型语言模型(LLM),如 ChatGPT,对各种医学应用具有重要意义。然而,ChatGPT 的训练主要来自以英语为中心的互联网数据,并没有专门针对医学领域进行调整。因此,对于中国大陆的医疗保健提供者和患者来说,开发一款中文眼科专用的 LLM 是非常必要的。
我们使用中文语料库开发了一款眼科 LLM(MOPH),并在三个临床场景中评估了其性能:中文眼科考试、回答基于循证医学的眼科问题以及对临床病例的诊断准确性。此外,我们还将 MOPH 的表现与人类医生进行了比较。
在眼科考试中,MOPH 的平均得分与受训者的平均得分(64.7(范围 62-68)与 66.2(范围 50-92),p=0.817)非常接近,但在所有七次模拟考试中都获得了 60 分以上的成绩。在回答眼科问题时,MOPH 对中国指南(Likert 量表 4-5)的响应符合率为 83.3%(25/30)。只有 6.7%(2/30,Likert 量表 1-2)和 10%(3/30,Likert 量表 3)的回答被评估者评为“较差或非常差”或“可能存在误解的不准确”。在诊断准确性方面,尽管眼科医生的正确诊断率优于 MOPH(96.1%对 81.1%,p>0.05),但差异无统计学意义。
本研究表明,这款特定于中文的眼科 LLM(MOPH)在不同的临床场景中具有很有前景的性能。MOPH 在中文眼科环境中具有潜在的实际应用价值。