评估大语言模型在结膜炎患者教育中的有效性。

Evaluating the effectiveness of large language models in patient education for conjunctivitis.

作者信息

Wang Jingyuan, Shi Runhan, Le Qihua, Shan Kun, Chen Zhi, Zhou Xujiao, He Yao, Hong Jiaxu

机构信息

Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China.

Macao Translatoinal Medicine Center, Macau University of Science and Technology, Taipa, Macau SAR, Macau, People's Republic of China.

出版信息

Br J Ophthalmol. 2025 Jan 28;109(2):185-191. doi: 10.1136/bjo-2024-325599.

DOI:10.1136/bjo-2024-325599

PMID:39214677

Abstract

AIMS

To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions.

METHODS

A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study's practical significance.

RESULTS

In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4's responses as the most detailed, with PaLM 2's being the most succinct. Phase 2 demonstrated GPT-4 and Qwen's robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals.

CONCLUSIONS

Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs.

摘要

目的

评估大语言模型（LLMs）对患者提出的结膜炎相关问题的回答质量。

方法

在复旦大学附属眼耳鼻喉科医院进行了一项两阶段的横断面研究。在第一阶段，四个大语言模型（GPT-4、文心一言、百川2和PaLM 2）回答了22个常见的结膜炎问题。六位眼科专家使用5分李克特量表从正确性、完整性、可读性、实用性和安全性方面评估这些回答，并辅以客观的可读性分析。第二阶段涉及30名结膜炎患者与GPT-4或文心一言进行互动，基于满意度、人性化、专业性以及除第一阶段正确性之外的相同维度评估大语言模型生成的回答。三位眼科医生使用第一阶段的标准评估回答，以便进行医学评估与患者评估之间的对比分析，探究该研究的实际意义。

结果

在第一阶段，GPT-4在所有指标上表现出色，尤其是在正确性（4.39±0.76）、完整性（4.31±0.96）和可读性（4.65±0.59）方面，而文心一言在实用性（4.37±0.93）和安全性（4.25±1.03）方面表现同样出色。百川2和PaLM 2有效，但落后于GPT-4和文心一言。客观可读性分析显示GPT-4的回答最详细，PaLM 2的回答最简洁。第二阶段显示GPT-4和文心一言表现稳健，患者和专业人员的满意度都很高且评价一致固定链接。