评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。

Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.

作者信息

Aydinbelge-Dizdar N, Dizdar K

机构信息

Department of Nuclear Medicine, Ankara Etlik City Hospital, Ankara, Turkiye.

Department of Software Engineering, ASELSAN Inc., Ankara, Turkiye.

出版信息

Rev Esp Med Nucl Imagen Mol (Engl Ed). 2025 Jan-Feb;44(1):500065. doi: 10.1016/j.remnie.2024.500065. Epub 2024 Sep 28.

DOI:10.1016/j.remnie.2024.500065

PMID:39349172

Abstract

PURPOSE

This study aimed to evaluate the reliability and readability of responses generated by two popular AI-chatbots, 'ChatGPT-4.0' and 'Google Gemini', to potential patient questions about PET/CT scans.

MATERIALS AND METHODS

Thirty potential questions for each of [F]FDG and [Ga]Ga-DOTA-SSTR PET/CT, and twenty-nine potential questions for [Ga]Ga-PSMA PET/CT were asked separately to ChatGPT-4 and Gemini in May 2024. The responses were evaluated for reliability and readability using the modified DISCERN (mDISCERN) scale, Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Reading Grade Level (FKRGL). The inter-rater reliability of mDISCERN scores provided by three raters (ChatGPT-4, Gemini, and a nuclear medicine physician) for the responses was assessed.

RESULTS

The median [min-max] mDISCERN scores reviewed by the physician for responses about FDG, PSMA and DOTA PET/CT scans were 3.5 [2-4], 3 [3-4], 3 [3-4] for ChatPT-4 and 4 [2-5], 4 [2-5], 3.5 [3-5] for Gemini, respectively. The mDISCERN scores assessed using ChatGPT-4 for answers about FDG, PSMA, and DOTA-SSTR PET/CT scans were 3.5 [3-5], 3 [3-4], 3 [2-3] for ChatGPT-4, and 4 [3-5], 4 [3-5], 4 [3-5] for Gemini, respectively. The mDISCERN scores evaluated using Gemini for responses FDG, PSMA, and DOTA-SSTR PET/CTs were 3 [2-4], 2 [2-4], 3 [2-4] for ChatGPT-4, and 3 [2-5], 3 [1-5], 3 [2-5] for Gemini, respectively. The inter-rater reliability correlation coefficient of mDISCERN scores for ChatGPT-4 responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.629 (95% CI = 0,32-0,812), 0.707 (95% CI = 0.458-0.853) and 0.738 (95% CI = 0.519-0.866), respectively (p < 0.001). The correlation coefficient of mDISCERN scores for Gemini responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.824 (95% CI = 0.677-0.910), 0.881 (95% CI = 0.78-0.94) and 0.847 (95% CI = 0.719-0.922), respectively (p < 0.001). The mDISCERN scores assessed by ChatGPT-4, Gemini, and the physician showed that the chatbots' responses about all PET/CT scans had moderate to good statistical agreement according to the inter-rater reliability correlation coefficient (p < 0,001). There was a statistically significant difference in all readability scores (FKRGL, GFI, and FRE) of ChatGPT-4 and Gemini responses about PET/CT scans (p < 0,001). Gemini responses were shorter and had better readability scores than ChatGPT-4 responses.

CONCLUSION

There was an acceptable level of agreement between raters for the mDISCERN score, indicating agreement with the overall reliability of the responses. However, the information provided by AI-chatbots cannot be easily read by the public.

摘要

目的

本研究旨在评估两款流行的人工智能聊天机器人“ChatGPT-4.0”和“谷歌Gemini”针对潜在患者关于PET/CT扫描问题所给出回答的可靠性和可读性。

材料与方法

2024年5月，分别向ChatGPT-4和Gemini提出了针对[F]FDG和[Ga]Ga-DOTA-SSTR PET/CT的30个潜在问题，以及针对[Ga]Ga-PSMA PET/CT的29个潜在问题。使用改良的DISCERN（mDISCERN）量表、弗莱什易读性（FRE）、冈宁雾度指数（GFI）和弗莱什-金凯德阅读年级水平（FKRGL）对回答进行可靠性和可读性评估。评估了三位评分者（ChatGPT-4、Gemini和一位核医学医生）对回答给出的mDISCERN分数的评分者间信度。

结果

医生对ChatGPT-4关于FDG、PSMA和DOTA PET/CT扫描回答的mDISCERN分数中位数[最小值-最大值]分别为3.5[2-4]、3[3-4]、3[3-4]，对Gemini的分别为4[2-5]、4[2-5]、3.5[3-5]。ChatGPT-4对关于FDG、PSMA和DOTA-SSTR PET/CT扫描答案的mDISCERN分数分别为3.5[3-5]、3[3-4]、3[2-3]，Gemini的分别为4[3-5]、4[3-5]、4[3-5]。Gemini对关于FDG、PSMA和DOTA-SSTR PET/CT扫描回答的mDISCERN分数分别为3[2-4]、2[2-4]、3[2-4]，ChatGPT-4的分别为3[2-5]、3[1-5]、3[2-5]。ChatGPT-4关于FDG、PSMA和DOTA-SSTR PET/CT扫描回答的mDISCERN分数的评分者间信度相关系数分别为0.629（95%CI = 0.32-0.812）、0.707（95%CI = 0.458-0.853）和0.738（95%CI = 0.519-0.866）（p < 0.001）。Gemini关于FDG、PSMA和DOTA-SSTR PET/CT扫描回答的mDISCERN分数的相关系数分别为0.824（95%CI = 0.677-0.910）、0.881（95%CI = 0.78-0.94）和0.847（95%CI = 0.719-0.922）（p < 0.001）。ChatGPT-4、Gemini和医生评估的mDISCERN分数表明，根据评分者间信度相关系数，聊天机器人关于所有PET/CT扫描的回答具有中等至良好的统计学一致性（p < 0.001）。ChatGPT-4和Gemini关于PET/CT扫描回答的所有可读性分数（FKRGL、GFI和FRE）存在统计学显著差异（p < 0.001）。Gemini的回答更短，可读性分数比ChatGPT-4的回答更好。

结论

评分者对mDISCERN分数的一致性水平可接受，表明对回答的整体可靠性达成了一致。然而，公众难以轻松读懂人工智能聊天机器人提供的信息。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。

Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.

作者信息

机构信息

出版信息

PURPOSE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。

Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.

作者信息

机构信息

出版信息

PURPOSE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献