Department of Woman, Child and of General and Specialized Surgery, Università Degli Studi Della Campania "Luigi Vanvitelli", Via Luigi De Crecchio 2, 80138, Naples, Italy.
Pediatr Nephrol. 2025 Jan;40(1):151-157. doi: 10.1007/s00467-024-06486-3. Epub 2024 Aug 16.
We aimed to evaluate the baseline performance and improvement of ChatGPT-4 "omni" (ChatGPT-4o) and Gemini 1.5 Flash (Gemini 1.5) in answering multiple-choice questions related to pediatric nephrology after specific training.
Using questions from the "Educational Review" articles published by Pediatric Nephrology between January 2014 and April 2024, the models were tested both before and after specific training with Portable Data Format (PDF) and text (TXT) file formats of the Educational Review articles removing the last page containing the correct answers using a Python script. The number of correct answers was recorded.
Before training, ChatGPT-4o correctly answered 75.2% of the 1395 questions, outperforming Gemini 1.5, which answered 64.9% correctly (p < 0.001). After training with PDF files, ChatGPT-4o's accuracy increased to 77.8%, while Gemini 1.5 improved significantly to 84.7% (p < 0.001). Training with TXT files showed similar results, with ChatGPT-4o maintaining 77.8% accuracy and Gemini 1.5 further improving to 87.6% (p < 0.001).
The study highlights that while ChatGPT-4o has strong baseline performance, specific training does not significantly enhance its accuracy. Conversely, Gemini 1.5, despite its lower initial performance, shows substantial improvement with training, particularly with TXT files. These findings suggest Gemini 1.5's superior ability to store and retrieve information, making it potentially more effective in clinical applications, albeit with a dependency on additional data for optimal performance.
我们旨在评估 ChatGPT-4“全能”(ChatGPT-4o)和 Gemini 1.5 Flash(Gemini 1.5)在接受特定培训后回答小儿肾病学相关多项选择题的基线表现和提高程度。
使用发表于 2014 年 1 月至 2024 年 4 月的《儿科肾脏病学》“教育评论”文章中的问题,使用 Python 脚本从 PDF 和 TXT 格式的教育评论文章中去除最后一页包含正确答案的部分,对模型进行了预培训和特定培训测试。记录正确答案的数量。
在培训前,ChatGPT-4o 正确回答了 1395 个问题中的 75.2%,优于正确回答 64.9%的 Gemini 1.5(p<0.001)。使用 PDF 文件进行培训后,ChatGPT-4o 的准确率提高到 77.8%,而 Gemini 1.5 则显著提高到 84.7%(p<0.001)。使用 TXT 文件进行培训的结果相似,ChatGPT-4o 的准确率保持在 77.8%,而 Gemini 1.5 进一步提高到 87.6%(p<0.001)。
研究表明,虽然 ChatGPT-4o 具有强大的基线表现,但特定培训并不能显著提高其准确性。相比之下,尽管 Gemini 1.5 的初始表现较低,但经过培训后有了显著的提高,尤其是使用 TXT 文件时。这些发现表明 Gemini 1.5 具有存储和检索信息的卓越能力,使其在临床应用中具有潜在的优势,尽管它需要额外的数据才能实现最佳性能。