Tsai Chao-Wei, Lin Yi-Jing, Hou Jing-Uei, Tsai Shih-Chuan, Yeh Pei-Chun, Kao Chia-Hung
Department of Nuclear Medicine, Taichung Veteran's General Hospital, Taichung, Taiwan.
Artificial Intelligence Center, China Medical University Hospital, Taichung, Taiwan.
Digit Health. 2025 Jul 7;11:20552076251357468. doi: 10.1177/20552076251357468. eCollection 2025 Jan-Dec.
ChatGPT has the potential to enhance patient education by offering clear and accurate responses, but its reliability in providing precise medical information is still under investigation. This study evaluates the effectiveness in assisting healthcare professionals with patient inquiries about radioiodine therapy.
This study used OpenAI's GPT-4o and GPT-4 models, with each query submitted as a separate prompt. Chain-of-thought prompting was utilized to require the model to articulate its step-by-step reasoning prior to the final answer, thereby making the decision process transparent for qualitative evaluation. Three responses were generated per prompt and evaluated by three nuclear medicine doctors using a 4-point Likert scale across five aspects: Appropriateness, Helpfulness, Consistency, Validity of References, and Empathy. Normality test, Wilcoxon signed-rank test, and chi-square tests were used for analysis.
A total of 126 paired responses from GPT-4 and GPT-4o were independently rated by three nuclear-medicine physicians. Both models performed similarly across the main dimensions-appropriateness, helpfulness, consistency, and validity of reference-with no statistically significant differences (Wilcoxon signed-rank, ≥ 0.01). High-level ratings (score ≥ 3) were achieved in appropriateness for 90.4% of GPT-4 outputs and 84.9% of GPT-4o outputs, and in helpfulness for 92.1% of outputs from both models. Citation accuracy was limited: fully valid references were present in 20.6% of GPT-4 and 21.4% of GPT-4o responses. Empathy was judged present in 56.3% of GPT-4 and 66.7% of GPT-4o answers (χ², > 0.05). There was low inter-rater agreement (Fleiss κ = 0.04).
The results suggest that ChatGPT can furnish generally appropriate and helpful answers to frequently asked questions in radioactive iodine treatment, yet citation accuracy remains limited, underscoring the need for clinician oversight. GPT-4o and GPT-4 demonstrated comparable performance, indicating that model selection within this family has minimal impact under the controlled conditions studied.
ChatGPT有潜力通过提供清晰准确的回答来加强患者教育,但其在提供精确医学信息方面的可靠性仍在研究中。本研究评估了ChatGPT在协助医护人员回答患者关于放射性碘治疗的询问方面的有效性。
本研究使用了OpenAI的GPT-4o和GPT-4模型,每个问题作为一个单独的提示提交。采用思维链提示,要求模型在给出最终答案之前阐述其逐步推理过程,从而使决策过程透明以便进行定性评估。每个提示生成三个回答,并由三名核医学医生使用4点李克特量表从五个方面进行评估:恰当性、有用性、一致性、参考文献的有效性和同理心。使用正态性检验、威尔科克森符号秩检验和卡方检验进行分析。
三名核医学医生对GPT-4和GPT-4o的总共126对回答进行了独立评分。两个模型在主要维度——恰当性、有用性、一致性和参考文献有效性——上表现相似,无统计学显著差异(威尔科克森符号秩检验,P≥0.01)。GPT-4的90.4%的输出和GPT-4o的84.9%的输出在恰当性方面获得了高分(得分≥3),两个模型92.1%的输出在有用性方面获得高分。引用准确性有限:GPT-4的20.6%的回答和GPT-4o的21.4%的回答中有完全有效的参考文献。GPT-4的56.3%的回答和GPT-4o的66.7%的回答被判定具有同理心(卡方检验,P>0.05)。评分者间的一致性较低(Fleiss κ=0.04)。
结果表明,ChatGPT能够为放射性碘治疗中的常见问题提供总体恰当且有用的答案,但引用准确性仍然有限,这突出了临床医生监督的必要性。GPT-4o和GPT-4表现出可比的性能,表明在本研究的受控条件下,该系列模型的选择影响极小。