Suppr超能文献

评估ChatGPT对放疗相关患者问题回答的质量和可靠性:与GPT-3.5和GPT-4的比较研究

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

作者信息

Grilo Ana, Marques Catarina, Corte-Real Maria, Carolino Elisabete, Caetano Marco

机构信息

Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101.

Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal.

出版信息

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

Abstract

BACKGROUND

Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment.

OBJECTIVE

This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4.

METHODS

We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used.

RESULTS

GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public.

CONCLUSIONS

Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.

摘要

背景

患者经常通过互联网获取有关癌症的信息。然而,这些网站的内容往往缺乏准确性和可读性。最近,由人工智能驱动的聊天机器人ChatGPT标志着癌症患者获取大量医疗信息(包括放疗见解)方式上的潜在范式转变。然而,ChatGPT提供的信息质量仍不明确。鉴于公众对这种治疗方法的了解有限以及对其可能副作用的担忧,这一点尤为重要。此外,评估回答的质量至关重要,因为错误信息可能会滋生虚假的知识感和安全感,导致不遵医嘱,并导致接受适当治疗的延迟。

目的

本研究旨在评估ChatGPT对患者关于放疗的常见问题的回答质量和可靠性,比较ChatGPT两个版本(GPT - 3.5和GPT - 4)的表现。

方法

我们选择了40个常见的放疗问题,并在ChatGPT的两个版本中输入这些问题。16位放疗专家使用通用质量评分(GQS)(一种5点李克特量表)评估回答质量和可靠性,并根据专家评分确定GQS的中位数。使用余弦相似性分数评估回答的一致性和相似性,该分数范围从0(完全不相似)到1(完全相似)。使用弗莱什易读性分数(范围从0到100)和弗莱什 - 金凯德年级水平分析可读性,后者反映理解所需的平均教育年限。使用曼 - 惠特尼检验和效应量进行统计分析,结果在5%水平(P = 0.05)被视为显著。为了评估专家之间的一致性,使用了克里彭多夫α和弗赖斯κ。

结果

与GPT - 3.5相比,GPT - 4表现更优,具有更高的GQS以及更低的1分和2分得分。曼 - 惠特尼检验显示在一些问题上存在统计学显著差异,GPT - 4通常获得更高评分。中位数(IQR)余弦相似性分数表明两个版本的回答具有高度相似性(0.81,IQR 0.05)和一致性(GPT - 3.5:0.85,IQR 0.04;GPT - 4:0.83,IQR 0.04)。两个版本的可读性分数都被认为是大学水平,与GPT - 3.5(分别为32.98和13.32)相比,GPT - 4在弗莱什易读性分数(34.61)和弗莱什 - 金凯德年级水平(12.32)上得分略高。两个版本的回答对普通公众来说都具有挑战性。

结论

GPT - 3.5和GPT - 4都表现出有能力阐述放疗概念,GPT - 4表现更优。然而,这两个模型对普通人群都存在可读性挑战。尽管ChatGPT显示出作为解决患者关于放疗常见问题的宝贵资源的潜力,但必须认识到其局限性,包括错误信息风险和可读性问题。此外,其应用应通过提高可及性和可读性的策略来支持。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e565/12017613/3fc23efbed85/cancer-v11-e63677-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验