Belge Bilgin Gokce, Bilgin Cem, Childs Daniel S, Orme Jacob J, Burkett Brian J, Packard Ann T, Johnson Derek R, Thorpe Matthew P, Riaz Irbaz Bin, Halfdanarson Thorvardur R, Johnson Geoffrey B, Sartor Oliver, Kendi Ayse Tuba
Department of Radiology, Mayo Clinic, Rochester, MN, United States.
Division of Medical Oncology, Department of Oncology, Mayo Clinic, Rochester, MN, United States.
Front Oncol. 2024 Jul 12;14:1386718. doi: 10.3389/fonc.2024.1386718. eCollection 2024.
Many patients use artificial intelligence (AI) chatbots as a rapid source of health information. This raises important questions about the reliability and effectiveness of AI chatbots in delivering accurate and understandable information.
To evaluate and compare the accuracy, conciseness, and readability of responses from OpenAI ChatGPT-4 and Google Bard to patient inquiries concerning the novel Lu-PSMA-617 therapy for prostate cancer.
Two experts listed the 12 most commonly asked questions by patients on Lu-PSMA-617 therapy. These twelve questions were prompted to OpenAI ChatGPT-4 and Google Bard. AI-generated responses were distributed using an online survey platform (Qualtrics) and blindly rated by eight experts. The performances of the AI chatbots were evaluated and compared across three domains: accuracy, conciseness, and readability. Additionally, potential safety concerns associated with AI-generated answers were also examined. The Mann-Whitney U and chi-square tests were utilized to compare the performances of AI chatbots.
Eight experts participated in the survey, evaluating 12 AI-generated responses across the three domains of accuracy, conciseness, and readability, resulting in 96 assessments (12 responses x 8 experts) for each domain per chatbot. ChatGPT-4 provided more accurate answers than Bard (2.95 ± 0.671 vs 2.73 ± 0.732, =0.027). Bard's responses had better readability than ChatGPT-4 (2.79 ± 0.408 vs 2.94 ± 0.243, =0.003). Both ChatGPT-4 and Bard achieved comparable conciseness scores (3.14 ± 0.659 vs 3.11 ± 0.679, =0.798). Experts categorized the AI-generated responses as incorrect or partially correct at a rate of 16.6% for ChatGPT-4 and 29.1% for Bard. Bard's answers contained significantly more misleading information than those of ChatGPT-4 ( = 0.039).
AI chatbots have gained significant attention, and their performance is continuously improving. Nonetheless, these technologies still need further improvements to be considered reliable and credible sources for patients seeking medical information on Lu-PSMA-617 therapy.
许多患者将人工智能(AI)聊天机器人作为快速获取健康信息的来源。这引发了关于AI聊天机器人在提供准确且易懂信息方面的可靠性和有效性的重要问题。
评估并比较OpenAI ChatGPT-4和谷歌巴德(Google Bard)对患者有关新型镥-PSMA-617前列腺癌疗法的询问所给出回答的准确性、简洁性和可读性。
两位专家列出了患者关于镥-PSMA-617疗法最常问的12个问题。这12个问题被输入到OpenAI ChatGPT-4和谷歌巴德中。通过在线调查平台(Qualtrics)分发AI生成的回答,并由八位专家进行盲评。在准确性、简洁性和可读性这三个领域评估并比较AI聊天机器人的表现。此外,还检查了与AI生成答案相关的潜在安全问题。采用曼-惠特尼U检验和卡方检验来比较AI聊天机器人的表现。
八位专家参与了调查,对12个AI生成的回答在准确性、简洁性和可读性这三个领域进行评估,每个聊天机器人每个领域产生96项评估结果(12个回答×8位专家)。ChatGPT-4提供的答案比巴德更准确(2.95±0.671对2.73±0.732,P=0.027)。巴德的回答比ChatGPT-4具有更好的可读性(2.79±0.408对2.94±0.243,P=0.003)。ChatGPT-4和巴德的简洁性得分相当(3.14±0.659对3.11±0.679,P=0.798)。专家将ChatGPT-4生成的回答归类为错误或部分正确的比例为16.6%,巴德为29.1%。巴德的答案中包含误导性信息的比例显著高于ChatGPT-4(P=0.039)。
AI聊天机器人已获得显著关注,其性能也在不断提升。尽管如此,对于寻求镥-PSMA-617疗法医疗信息的患者而言,这些技术仍需进一步改进,才能被视为可靠且可信的信息来源。