Kayra Mehmet Vehbi, Anil Hakan, Ozdogan Ilturk, Baradia Suhail Mohamed Amin, Toksoz Serdar
Department of Urology, Baskent University Adana Dr. Turgut Noyan Application and Research Center, Adana, Turkey.
Department of Urology, Adana City Hospital, Adana, Turkey.
Int J Impot Res. 2025 Jun 3. doi: 10.1038/s41443-025-01098-3.
This study aims to evaluate and compare the performance of artificial intelligence chatbots by assessing the reliability and quality of the information they provide regarding penis enhancement (PE). Search trends for keywords related to PE were determined using Google Trends ( https://trends.google.com ) and Semrush ( https://www.semrush.com ). Data covering a ten-year period was analyzed, taking into account regional trends and changes in search volume. Based on these trends, 25 questions were selected and categorized into three groups: general information (GI), surgical treatment (ST) and myths/misconceptions (MM). These questions were posed to three advanced chatbots: ChatGPT-4, Gemini Pro and Llama 3.1. Responses from each model were analyzed for readability using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES), while the quality of the responses was evaluated using the Ensuring Quality Information for Patients (EQIP) tool and the Modified DISCERN Score. All chatbot responses exhibited difficulty in readability and understanding according to FKGL and FRES, with no statistically significant differences among them (FKGL: p = 0.167; FRES: p = 0.366). Llama achieved the highest median Modified DISCERN score (4 [IQR:1]), significantly outperforming ChatGPT (3 [IQR:0]) and Gemini (3 [IQR:2]) (p < 0.001). Pairwise comparisons showed no significant difference between ChatGPT and Gemini (p = 0.070), but Llama was superior to both (p < 0.001). In EQIP scores, Llama also scored highest (73.8 ± 2.2), significantly surpassing ChatGPT (68.7 ± 2.1) and Gemini (54.2 ± 1.3) (p < 0.001). Across categories, Llama consistently achieved higher EQIP scores (GI:71.1 ± 1.6; ST: 73.6 ± 4.1; MM: 76.3 ± 2.1) and Modified DISCERN scores (GI:4 [IQR:0]; ST:4 [IQR:1]; MM:3 [IQR:1]) compared to ChatGPT (EQIP: GI:68.4 ± 1.1; ST: 65.7 ± 2.2; MM:71.1 ± 1.7; Modified DISCERN: GI:3 [IQR:1]; ST:3 [IQR:1]; MM:3 [IQR:0]) and Gemini (EQIP: GI:55.2 ± 1.4; ST:55.2 ± 1.6; MM:2.6 ± 2.5; Modified DISCERN: GI:1 [IQR:2]; ST:1 [IQR:2]; MM:3 [IQR:0]) (p < 0.001). This study highlights Llama's superior reliability in providing PE-related health information, though all chatbots struggled with readability.
本研究旨在通过评估人工智能聊天机器人提供的有关阴茎增大(PE)信息的可靠性和质量,来评估和比较它们的性能。使用谷歌趋势(https://trends.google.com)和Semrush(https://www.semrush.com)确定与PE相关的关键词搜索趋势。分析了涵盖十年的数据,同时考虑了区域趋势和搜索量的变化。基于这些趋势,选择了25个问题并分为三组:一般信息(GI)、手术治疗(ST)以及误区/误解(MM)。将这些问题抛给三个先进的聊天机器人:ChatGPT-4、Gemini Pro和Llama 3.1。使用弗莱什-金凯德年级水平(FKGL)和弗莱什阅读易读性得分(FRES)分析每个模型回复的可读性,同时使用患者质量信息保障(EQIP)工具和改良的DISCERN评分评估回复的质量。根据FKGL和FRES,所有聊天机器人的回复在可读性和理解方面都存在困难,它们之间没有统计学上的显著差异(FKGL:p = 0.167;FRES:p = 0.366)。Llama的改良DISCERN评分中位数最高(4[四分位距:1]),显著优于ChatGPT(3[四分位距:0])和Gemini(3[四分位距:2])(p < 0.001)。两两比较显示ChatGPT和Gemini之间没有显著差异(p = 0.070),但Llama优于两者(p < 0.001)。在EQIP评分中,Llama得分也最高(73.8±2.2),显著超过ChatGPT(68.7±2.1)和Gemini(54.2±1.3)(p < 0.001)。在各个类别中,与ChatGPT(EQIP:GI:68.4±1.1;ST:65.7±2.2;MM:71.1±1.7;改良DISCERN:GI:3[四分位距:1];ST:3[四分位距:1];MM:3[四分位距:0])和Gemini(EQIP:GI:55.2±1.4;ST:55.2±1.6;MM:2.6±2.5;改良DISCERN:GI:1[四分位距:2];ST:1[四分位距:2];MM:3[四分位距:0])相比,Llama始终获得更高的EQIP评分(GI:71.1±1.6;ST:73.6±4.1;MM:76.3±2.1)和改良DISCERN评分(GI:4[四分位距:0];ST:4[四分位距:1];MM:3[四分位距:1])(p < 0.001)。本研究强调了Llama在提供与PE相关的健康信息方面具有更高的可靠性,尽管所有聊天机器人在可读性方面都存在困难。