生成式语言模型在个性化医疗信息方面的评估：工具验证研究

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study.

作者信息

Spina Aidin, Andalib Saman, Flores Daniel, Vermani Rishi, Halaseh Faris F, Nelson Ariana M

机构信息

School of Medicine, University of California, Irvine, Irvine, CA, United States.

Department of Anesthesiology and Perioperative Care, University of California, Irvine, Irvine, CA, United States.

出版信息

JMIR AI. 2024 Aug 13;3:e54371. doi: 10.2196/54371.

DOI:10.2196/54371

PMID:39137416

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11350306/

Abstract

BACKGROUND

Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy.

OBJECTIVE

The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy.

METHODS

Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL).

RESULTS

Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor's, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor's degree. Conversely, GPT-4's FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels.

CONCLUSIONS

GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor's degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.

摘要

背景

尽管在实施方面存在不确定性，但人工智能驱动的生成式语言模型（GLMs）在医学领域具有巨大潜力。GLMs的应用可以提高患者对临床文本的理解，并改善健康素养较低的情况。

目的

本研究的目的是评估ChatGPT-3.5和GPT-4根据患者特定的输入教育水平调整医学信息复杂性的潜力，这对于将其用作解决健康素养较低问题的工具至关重要。

方法

设计了与两种常见慢性病——2型糖尿病和高血压相关的输入模板。每个临床案例针对假设的患者教育水平进行了调整，以评估输出的个性化程度。为了评估GLM（GPT-3.5和GPT-4）在调整输出文本方面的成功程度，使用弗莱什易读性分数（FKRE）和弗莱什-金凯德年级水平（FKGL）对转换前后输出的可读性进行了量化。

结果

使用GPT-3.5和GPT-4针对2个临床案例生成了80条回复。对于GPT-3.5，六年级、八年级、高中和本科水平的FKRE均值分别为57.75（标准差4.75）、51.28（标准差5.14）、32.28（标准差4.52）和28.31（标准差5.22）；FKGL平均分数分别为9.08（标准差0.90）、10.27（标准差1.06）、13.4（标准差0.80）和13.74（标准差1.18）。GPT-3.5仅在本科水平上与预先指定的教育水平相符。相反，GPT-4在相同教育水平下的FKRE均值分数分别为74.54（标准差2.6）、71.25（标准差4.96）、47.61（标准差6.13）和13.71（标准差5.77），FKGL平均分数分别为6.3（标准差0.73）、6.7（标准差1.11）和11.09（标准差1.26）以及17.03（标准差1.11）。除六年级FKRE平均值外，GPT-4满足所有组的目标可读性。两个GLM在不同输入教育水平下的平均FKRE和FKGL之间产生的输出均存在统计学显著差异（P<0.001；八年级P<0.001；高中P<0.001；本科P = 0.003；FKGL：六年级P = 0.001；八年级P<0.001；高中P<0.001；本科P<0.001）。

结论

GLMs可以根据输入指定的教育程度改变医学文本输出的结构和可读性。然而，GLMs将输入教育指定分类为输出可读性的三个大致层次：简单（六年级和八年级）、中等（高中）和困难（本科）。这是第一个表明GLMs在输出文本简化方面的成功存在更广泛界限的结果。未来的研究必须确定GLMs如何能够可靠地将医学文本个性化到预先指定的教育水平，以便对医疗保健素养产生更广泛的影响。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

生成式语言模型在个性化医疗信息方面的评估：工具验证研究

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

生成式语言模型在个性化医疗信息方面的评估：工具验证研究

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献