Hacibey Ibrahim, Kaba Esat
Department of Urology, Basaksehir Çam and Sakura City Hospital, Istanbul, Turkey.
Department of Radiology, Recep Tayyip Erdogan University, Rize, Turkey.
Radiologie (Heidelb). 2025 Aug 24. doi: 10.1007/s00117-025-01499-x.
The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.
This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.
A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.
GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.
When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.
博斯尼亚克分类系统广泛用于评估肾囊性病变的恶性风险,但观察者间的差异带来了重大挑战。当提供文本描述(如放射学报告中的描述)时,大语言模型(LLMs)可能会提供一种标准化的分类方法。
本研究评估了五种大语言模型——GPT-4(ChatGPT)、Gemini、Copilot、Perplexity和NotebookLM——基于模拟CT报告内容的合成文本描述对肾囊肿进行分类的性能。
使用既定的放射学标准构建了一个包含100个诊断场景(每个博斯尼亚克类别20例)的合成数据集。每个大语言模型使用零样本和少样本提示策略进行评估,而NotebookLM采用检索增强生成(RAG)。性能指标包括准确性、敏感性和特异性。使用麦克尼马尔检验和卡方检验评估统计学意义。
GPT-4的准确性最高(零样本时为87%,少样本时为99%),其次是Copilot(81%-86%)、Gemini(55%-69%)和Perplexity(43%-69%)。仅在RAG条件下测试的NotebookLM的准确性达到87%。少样本学习显著提高了性能(p<0.05)。博斯尼亚克IIF病变的分类在各模型中仍然具有挑战性。
当提供结构良好的文本描述时,大语言模型可以准确地对肾囊肿进行分类。少样本提示显著提高了性能。然而,对诸如博斯尼亚克IIF等临界病变进行分类时持续存在的困难凸显了进一步完善和进行现实世界验证的必要性。