人工智能作为一种提高神经外科文献对患者可读性的方式。

Artificial intelligence as a modality to enhance the readability of neurosurgical literature for patients.

作者信息

Guerra Gage A, Grove Sophie, Le Jonathan, Hofmann Hayden L, Shah Ishan, Bhagavatula Sweta, Fixman Benjamin, Gomez David, Hopkins Benjamin, Dallas Jonathan, Cacciamani Giovanni, Peterson Racheal, Zada Gabriel

机构信息

Departments of1Neurosurgery and.

2Urology, University of Southern California, Los Angeles, California.

出版信息

J Neurosurg. 2024 Nov 8;142(4):1189-1195. doi: 10.3171/2024.6.JNS24617. Print 2025 Apr 1.

DOI:10.3171/2024.6.JNS24617

PMID:39504543

Abstract

OBJECTIVE

In this study the authors assessed the ability of Chat Generative Pretrained Transformer (ChatGPT) 3.5 and ChatGPT4 to generate readable and accurate summaries of published neurosurgical literature.

METHODS

Abstracts published in journal issues released between June 2023 and August 2023 (n = 150) were randomly selected from the top 5 ranked neurosurgical journals according to Google Scholar. ChatGPT models were instructed to generate a readable layperson summary of the original abstract from a statistically validated prompt. Readability results and grade-level indicators (RR-GLIs) scores were calculated for GPT3.5- and GPT4-generated summaries and original abstracts. Two physicians independently rated the accuracy of ChatGPT-generated layperson summaries to assess scientific validity. One-way ANOVA followed by pairwise t-test with Bonferroni correction were performed to compare readability scores. Cohen's kappa was used to assess interrater agreement between the two rater physicians.

RESULTS

Analysis of 150 original abstracts showed a statistically significant difference for all RR-GLIs between the ChatGPT-generated summaries and original abstracts. The readability scores are formatted as follows (original abstract mean, GPT3.5 summary mean, GPT4 summary mean, p value): Flesch-Kincaid reading grade (12.55, 7.80, 7.70, p < 0.0001); Gunning fog score (15.46, 10.00, 9.00, p < 0.0001); Simple Measure of Gobbledygook (SMOG) index (11.30, 7.13, 6.60, p < 0.0001); Coleman-Liau index (14.67, 11.32, 10.26, p < 0.0001); automated readability index (10.87, 8.50, 7.75, p < 0.0001); and Flesch-Kincaid reading ease (33.29, 68.45, 69.55, p < 0.0001). GPT4-generated summaries demonstrated higher RR-GLIs than GPT3.5-generated summaries in the following categories: Gunning fog score (0.0003); SMOG index (0.027); Coleman-Liau index (< 0.0001); sentences (< 0.0001); complex words (< 0.0001); and % complex words (0.0035). A total of 68.4% and 84.2% of GPT3.5- and GPT4-generated summaries, respectively, maintained moderate scientific accuracy according to the two physician-reviewers.

CONCLUSIONS

The findings demonstrate promising potential for application of the ChatGPT in patient education. GPT4 is an accessible tool that can be an immediate solution to enhancing the readability of current neurosurgical literature. Layperson summaries generated by GPT4 would be a valuable addition to a neurosurgical journal and would be likely to improve comprehension for patients using internet resources like PubMed.

摘要

目的

在本研究中，作者评估了聊天生成预训练变换器（ChatGPT）3.5和ChatGPT4生成已发表神经外科文献的可读且准确摘要的能力。

方法

根据谷歌学术排名，从排名前5的神经外科期刊中随机选取2023年6月至2023年8月期间发表的摘要（n = 150）。指导ChatGPT模型根据经过统计验证的提示生成原始摘要的可读性外行人摘要。计算GPT3.5和GPT4生成的摘要以及原始摘要的可读性结果和年级水平指标（RR - GLIs）分数。两名医生独立评估ChatGPT生成的外行人摘要的准确性，以评估科学有效性。进行单因素方差分析，随后进行带有Bonferroni校正的成对t检验以比较可读性分数。使用Cohen's kappa评估两位评分医生之间的评分者间一致性。

结果

对150篇原始摘要的分析表明，ChatGPT生成的摘要与原始摘要之间在所有RR - GLIs上均存在统计学显著差异。可读性分数格式如下（原始摘要均值、GPT3.5摘要均值、GPT4摘要均值、p值）：弗莱施 - 金凯德阅读年级（12.55、7.80、7.70，p < 0.0001）；冈宁迷雾分数（15.46、10.00、9.00，p < 0.0001）；复杂词汇简易衡量指标（SMOG）指数（11.30、7.13、6.60，p < 0.0001）；科尔曼 - 廖指数（14.