Aydinbelge-Dizdar N, Dizdar K
Department of Nuclear Medicine, Ankara Etlik City Hospital, Ankara, Turkiye.
Department of Software Engineering, ASELSAN Inc., Ankara, Turkiye.
Rev Esp Med Nucl Imagen Mol (Engl Ed). 2025 Jan-Feb;44(1):500065. doi: 10.1016/j.remnie.2024.500065. Epub 2024 Sep 28.
This study aimed to evaluate the reliability and readability of responses generated by two popular AI-chatbots, 'ChatGPT-4.0' and 'Google Gemini', to potential patient questions about PET/CT scans.
Thirty potential questions for each of [F]FDG and [Ga]Ga-DOTA-SSTR PET/CT, and twenty-nine potential questions for [Ga]Ga-PSMA PET/CT were asked separately to ChatGPT-4 and Gemini in May 2024. The responses were evaluated for reliability and readability using the modified DISCERN (mDISCERN) scale, Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Reading Grade Level (FKRGL). The inter-rater reliability of mDISCERN scores provided by three raters (ChatGPT-4, Gemini, and a nuclear medicine physician) for the responses was assessed.
The median [min-max] mDISCERN scores reviewed by the physician for responses about FDG, PSMA and DOTA PET/CT scans were 3.5 [2-4], 3 [3-4], 3 [3-4] for ChatPT-4 and 4 [2-5], 4 [2-5], 3.5 [3-5] for Gemini, respectively. The mDISCERN scores assessed using ChatGPT-4 for answers about FDG, PSMA, and DOTA-SSTR PET/CT scans were 3.5 [3-5], 3 [3-4], 3 [2-3] for ChatGPT-4, and 4 [3-5], 4 [3-5], 4 [3-5] for Gemini, respectively. The mDISCERN scores evaluated using Gemini for responses FDG, PSMA, and DOTA-SSTR PET/CTs were 3 [2-4], 2 [2-4], 3 [2-4] for ChatGPT-4, and 3 [2-5], 3 [1-5], 3 [2-5] for Gemini, respectively. The inter-rater reliability correlation coefficient of mDISCERN scores for ChatGPT-4 responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.629 (95% CI = 0,32-0,812), 0.707 (95% CI = 0.458-0.853) and 0.738 (95% CI = 0.519-0.866), respectively (p < 0.001). The correlation coefficient of mDISCERN scores for Gemini responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.824 (95% CI = 0.677-0.910), 0.881 (95% CI = 0.78-0.94) and 0.847 (95% CI = 0.719-0.922), respectively (p < 0.001). The mDISCERN scores assessed by ChatGPT-4, Gemini, and the physician showed that the chatbots' responses about all PET/CT scans had moderate to good statistical agreement according to the inter-rater reliability correlation coefficient (p < 0,001). There was a statistically significant difference in all readability scores (FKRGL, GFI, and FRE) of ChatGPT-4 and Gemini responses about PET/CT scans (p < 0,001). Gemini responses were shorter and had better readability scores than ChatGPT-4 responses.
There was an acceptable level of agreement between raters for the mDISCERN score, indicating agreement with the overall reliability of the responses. However, the information provided by AI-chatbots cannot be easily read by the public.
本研究旨在评估两款流行的人工智能聊天机器人“ChatGPT-4.0”和“谷歌Gemini”针对潜在患者关于PET/CT扫描问题所给出回答的可靠性和可读性。
2024年5月,分别向ChatGPT-4和Gemini提出了针对[F]FDG和[Ga]Ga-DOTA-SSTR PET/CT的30个潜在问题,以及针对[Ga]Ga-PSMA PET/CT的29个潜在问题。使用改良的DISCERN(mDISCERN)量表、弗莱什易读性(FRE)、冈宁雾度指数(GFI)和弗莱什-金凯德阅读年级水平(FKRGL)对回答进行可靠性和可读性评估。评估了三位评分者(ChatGPT-4、Gemini和一位核医学医生)对回答给出的mDISCERN分数的评分者间信度。
医生对ChatGPT-4关于FDG、PSMA和DOTA PET/CT扫描回答的mDISCERN分数中位数[最小值-最大值]分别为3.5[2-4]、3[3-4]、3[3-4],对Gemini的分别为4[2-5]、4[2-5]、3.5[3-5]。ChatGPT-4对关于FDG、PSMA和DOTA-SSTR PET/CT扫描答案的mDISCERN分数分别为3.5[3-5]、3[3-4]、3[2-3],Gemini的分别为4[3-5]、4[3-5]、4[3-5]。Gemini对关于FDG、PSMA和DOTA-SSTR PET/CT扫描回答的mDISCERN分数分别为3[2-4]、2[2-4]、3[2-4],ChatGPT-4的分别为3[2-5]、3[1-5]、3[2-5]。ChatGPT-4关于FDG、PSMA和DOTA-SSTR PET/CT扫描回答的mDISCERN分数的评分者间信度相关系数分别为0.629(95%CI = 0.32-0.812)、0.707(95%CI = 0.458-0.853)和0.738(95%CI = 0.519-0.866)(p < 0.001)。Gemini关于FDG、PSMA和DOTA-SSTR PET/CT扫描回答的mDISCERN分数的相关系数分别为0.824(95%CI = 0.677-0.910)、0.881(95%CI = 0.78-0.94)和0.847(95%CI = 0.719-0.922)(p < 0.001)。ChatGPT-4、Gemini和医生评估的mDISCERN分数表明,根据评分者间信度相关系数,聊天机器人关于所有PET/CT扫描的回答具有中等至良好的统计学一致性(p < 0.001)。ChatGPT-4和Gemini关于PET/CT扫描回答的所有可读性分数(FKRGL、GFI和FRE)存在统计学显著差异(p < 0.001)。Gemini的回答更短,可读性分数比ChatGPT-4的回答更好。
评分者对mDISCERN分数的一致性水平可接受,表明对回答的整体可靠性达成了一致。然而,公众难以轻松读懂人工智能聊天机器人提供的信息。