使用ChatGPT3.5和GPT4进行提示工程以改善对视网膜疾病患者的教育。

Prompt engineering with ChatGPT3.5 and GPT4 to improve patient education on retinal diseases.

作者信息

Jung Hoyoung, Oh Jean, Stephenson Kirk A J, Joe Aaron W, Mammo Zaid N

机构信息

Faculty of Medicine, University of British Columbia, Vancouver BC, Canada.

Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada.

出版信息

Can J Ophthalmol. 2025 Jun;60(3):e375-e381. doi: 10.1016/j.jcjo.2024.08.010. Epub 2024 Sep 5.

DOI:10.1016/j.jcjo.2024.08.010

PMID:39245293

Abstract

OBJECTIVE

To assess the effect of prompt engineering on the accuracy, comprehensiveness, readability, and empathy of large language model (LLM)-generated responses to patient questions regarding retinal disease.

DESIGN

Prospective qualitative study.

PARTICIPANTS

Retina specialists, ChatGPT3.5, and GPT4.

METHODS

Twenty common patient questions regarding 5 retinal conditions were inputted to ChatGPT3.5 and GPT4 as a stand-alone question or preceded by an optimized prompt (prompt A) or preceded by prompt A with specified limits to length and grade reading level (prompt B). Accuracy and comprehensiveness were graded by 3 retina specialists on a Likert scale from 1 to 5 (1: very poor to 5: very good). Readability of responses was assessed using Readable.com, an online readability tool.

RESULTS

There were no significant differences between ChatGPT3.5 and GPT4 across any of the metrics tested. Median accuracy of responses to a stand-alone question, prompt A, and prompt B questions were 5.0, 5.0, and 4.0, respectively. Median comprehensiveness of responses to a stand-alone question, prompt A, and prompt B questions were 5.0, 5.0, and 4.0, respectively. The use of prompt B was associated with a lower accuracy and comprehensiveness than responses to stand-alone question or prompt A questions (p < 0.001). Average-grade reading level of responses across both LLMs were 13.45, 11.5, and 10.3 for a stand-alone question, prompt A, and prompt B questions, respectively (p < 0.001).

CONCLUSIONS

Prompt engineering can significantly improve readability of LLM-generated responses, although at the cost of reducing accuracy and comprehensiveness. Further study is needed to understand the utility and bioethical implications of LLMs as a patient educational resource.

摘要

目的

评估提示工程对大语言模型（LLM）针对患者有关视网膜疾病问题所生成回复的准确性、全面性、可读性和共情能力的影响。

设计

前瞻性定性研究。

参与者

视网膜专家、ChatGPT3.5和GPT4。

方法

将关于5种视网膜疾病的20个常见患者问题作为独立问题输入ChatGPT3.5和GPT4，或者在问题之前加上优化提示（提示A），或者在提示A之前加上对长度和阅读年级水平的特定限制（提示B）。3位视网膜专家根据李克特量表（1：非常差至5：非常好）对准确性和全面性进行评分。使用在线可读性工具Readable.com评估回复的可读性。

结果

在任何测试指标上，ChatGPT3.5和GPT4之间均无显著差异。独立问题、提示A和提示B问题的回复中位数准确性分别为5.0、5.0和4.0。独立问题、提示A和提示B问题的回复中位数全面性分别为5.0、5.0和4.0。与独立问题或提示A问题的回复相比，使用提示B与较低的准确性和全面性相关（p<0.001）。对于独立问题、提示A和提示B问题，两个大语言模型回复的平均阅读年级水平分别为13.45、11.5和10.3（p<0.001）。