评估ChatGPT对放疗相关患者问题回答的质量和可靠性：与GPT-3.5和GPT-4的比较研究

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

作者信息

Grilo Ana, Marques Catarina, Corte-Real Maria, Carolino Elisabete, Caetano Marco

机构信息

Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101.

Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal.

出版信息

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

DOI:10.2196/63677

PMID:40239208

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12017613/

Abstract

BACKGROUND

Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment.

OBJECTIVE

This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4.

METHODS

We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used.

RESULTS

GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public.

CONCLUSIONS

Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.

摘要

背景

患者经常通过互联网获取有关癌症的信息。然而，这些网站的内容往往缺乏准确性和可读性。最近，由人工智能驱动的聊天机器人ChatGPT标志着癌症患者获取大量医疗信息（包括放疗见解）方式上的潜在范式转变。然而，ChatGPT提供的信息质量仍不明确。鉴于公众对这种治疗方法的了解有限以及对其可能副作用的担忧，这一点尤为重要。此外，评估回答的质量至关重要，因为错误信息可能会滋生虚假的知识感和安全感，导致不遵医嘱，并导致接受适当治疗的延迟。

目的

本研究旨在评估ChatGPT对患者关于放疗的常见问题的回答质量和可靠性，比较ChatGPT两个版本（GPT - 3.5和GPT - 4）的表现。

方法

我们选择了40个常见的放疗问题，并在ChatGPT的两个版本中输入这些问题。16位放疗专家使用通用质量评分（GQS）（一种5点李克特量表）评估回答质量和可靠性，并根据专家评分确定GQS的中位数。使用余弦相似性分数评估回答的一致性和相似性，该分数范围从0（完全不相似）到1（完全相似）。使用弗莱什易读性分数（范围从0到100）和弗莱什 - 金凯德年级水平分析可读性，后者反映理解所需的平均教育年限。使用曼 - 惠特尼检验和效应量进行统计分析，结果在5%水平（P = 0.05）被视为显著。为了评估专家之间的一致性，使用了克里彭多夫α和弗赖斯κ。

结果

与GPT - 3.5相比，GPT - 4表现更优，具有更高的GQS以及更低的1分和2分得分。曼 - 惠特尼检验显示在一些问题上存在统计学显著差异，GPT - 4通常获得更高评分。中位数（IQR）余弦相似性分数表明两个版本的回答具有高度相似性（0.81，IQR 0.05）和一致性（GPT - 3.5：0.85，IQR 0.04；GPT - 4：0.83，IQR 0.04）。两个版本的可读性分数都被认为是大学水平，与GPT - 3.5（分别为32.98和13.32）相比，GPT - 4在弗莱什易读性分数（34.61）和弗莱什 - 金凯德年级水平（12.32）上得分略高。两个版本的回答对普通公众来说都具有挑战性。

结论

GPT - 3.5和GPT - 4都表现出有能力阐述放疗概念，GPT - 4表现更优。然而，这两个模型对普通人群都存在可读性挑战。尽管ChatGPT显示出作为解决患者关于放疗常见问题的宝贵资源的潜力，但必须认识到其局限性，包括错误信息风险和可读性问题。此外，其应用应通过提高可及性和可读性的策略来支持。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e565/12017613/3fc23efbed85/cancer-v11-e63677-g001.jpg

相似文献

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.评估ChatGPT对放疗相关患者问题回答的质量和可靠性：与GPT-3.5和GPT-4的比较研究

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

Evaluation of the reliability, usefulness, quality and readability of ChatGPT's responses on Scoliosis.评估ChatGPT对脊柱侧弯问题回答的可靠性、实用性、质量和可读性。

Eur J Orthop Surg Traumatol. 2025 Mar 18;35(1):123. doi: 10.1007/s00590-025-04198-4.

Appropriateness and readability of Google Bard and ChatGPT-3.5 generated responses for surgical treatment of glaucoma.谷歌巴德和 ChatGPT-3.5 生成的青光眼手术治疗回复的适宜性和可读性。

Rom J Ophthalmol. 2024 Jul-Sep;68(3):243-248. doi: 10.22336/rjo.2024.45.

Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment.评估 ChatGPT 在前列腺癌患者教育中的疗效：多指标评估。

J Med Internet Res. 2024 Aug 14;26:e55939. doi: 10.2196/55939.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Evaluating ChatGPT to test its robustness as an interactive information database of radiation oncology and to assess its responses to common queries from radiotherapy patients: A single institution investigation.评估ChatGPT以测试其作为放射肿瘤学交互式信息数据库的稳健性，并评估其对放疗患者常见问题的回答：一项单机构调查。

Cancer Radiother. 2024 Jun;28(3):258-264. doi: 10.1016/j.canrad.2023.11.005. Epub 2024 Jun 12.

Improving readability in AI-generated medical information on fragility fractures: the role of prompt wording on ChatGPT's responses.提高人工智能生成的关于脆性骨折的医学信息的可读性：提示措辞对ChatGPT回复的作用。

Osteoporos Int. 2025 Mar;36(3):403-410. doi: 10.1007/s00198-024-07358-0. Epub 2025 Jan 8.

Information Quality and Readability: ChatGPT's Responses to the Most Common Questions About Spinal Cord Injury.信息质量与可读性：ChatGPT 对脊髓损伤常见问题的回答

World Neurosurg. 2024 Jan;181:e1138-e1144. doi: 10.1016/j.wneu.2023.11.062. Epub 2023 Nov 22.

Digesting Digital Health: A Study of Appropriateness and Readability of ChatGPT-Generated Gastroenterological Information.消化数字健康：ChatGPT 生成的胃肠病学信息适宜性和可读性的研究。

Clin Transl Gastroenterol. 2024 Nov 1;15(11):e00765. doi: 10.14309/ctg.0000000000000765.

Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients.生成式人工智能聊天机器人可能会为患者关于常见血管外科问题提供恰当的信息性回复。

Vascular. 2025 Feb;33(1):229-237. doi: 10.1177/17085381241240550. Epub 2024 Mar 18.

本文引用的文献

Evaluation of the Appropriateness and Readability of ChatGPT-4 Responses to Patient Queries on Uveitis.评估ChatGPT-4对葡萄膜炎患者问题的回答的恰当性和可读性。

Ophthalmol Sci. 2024 Aug 8;5(1):100594. doi: 10.1016/j.xops.2024.100594. eCollection 2025 Jan-Feb.

Impact of patient information format on the experience of cancer patients treated with radiotherapy.患者信息格式对接受放射治疗的癌症患者体验的影响。

Tech Innov Patient Support Radiat Oncol. 2024 May 9;30:100252. doi: 10.1016/j.tipsro.2024.100252. eCollection 2024 Jun.

Artificial intelligence large language model ChatGPT: is it a trustworthy and reliable source of information for sarcoma patients?人工智能大语言模型 ChatGPT：它是肉瘤患者值得信赖和可靠的信息来源吗？

Front Public Health. 2024 Mar 22;12:1303319. doi: 10.3389/fpubh.2024.1303319. eCollection 2024.

Examining the role of ChatGPT in promoting health behaviors and lifestyle changes among cancer patients.研究ChatGPT在促进癌症患者健康行为和生活方式改变方面的作用。

Nutr Health. 2025 Jun;31(2):739-748. doi: 10.1177/02601060241244563. Epub 2024 Apr 3.

Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions.大型语言模型对放射肿瘤学患者护理问题的回复质量。

JAMA Netw Open. 2024 Apr 1;7(4):e244630. doi: 10.1001/jamanetworkopen.2024.4630.

Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model.人工智能驱动的大型语言模型提供的前列腺癌预防和筛查建议的充分性。

Int Urol Nephrol. 2024 Aug;56(8):2589-2595. doi: 10.1007/s11255-024-04009-5. Epub 2024 Apr 2.

ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions.ChatGPT的回答一致性：关于医学考试问题重复查询的研究

Eur J Investig Health Psychol Educ. 2024 Mar 8;14(3):657-668. doi: 10.3390/ejihpe14030043.

Evaluating ChatGPT's accuracy in providing screening mammography recommendations among older women: Artificial Intelligence and cancer communication.评估ChatGPT在为老年女性提供乳腺钼靶筛查建议方面的准确性：人工智能与癌症沟通

J Am Geriatr Soc. 2024 Jul;72(7):2237-2240. doi: 10.1111/jgs.18854. Epub 2024 Mar 14.

The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research.ChatGPT 的温度特征：为临床研究修改创造力。

JMIR Hum Factors. 2024 Mar 8;11:e53559. doi: 10.2196/53559.

Patient Education Practices and Preferences of Radiation Oncologists and Interprofessional Radiation Therapy Care Teams: A Mixed-Methods Study Exploring Strategies for Effective Patient Education Delivery.患者教育实践与辐射肿瘤学家和跨专业放射治疗护理团队的偏好：探索有效患者教育传递策略的混合方法研究。

Int J Radiat Oncol Biol Phys. 2024 Aug 1;119(5):1357-1367. doi: 10.1016/j.ijrobp.2024.02.023. Epub 2024 Mar 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估ChatGPT对放疗相关患者问题回答的质量和可靠性：与GPT-3.5和GPT-4的比较研究

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献