• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用大语言模型进行肾囊肿的博斯尼亚克分类:一项比较研究。

Bosniak classification of renal cysts using large language models: a comparative study.

作者信息

Hacibey Ibrahim, Kaba Esat

机构信息

Department of Urology, Basaksehir Çam and Sakura City Hospital, Istanbul, Turkey.

Department of Radiology, Recep Tayyip Erdogan University, Rize, Turkey.

出版信息

Radiologie (Heidelb). 2025 Aug 24. doi: 10.1007/s00117-025-01499-x.

DOI:10.1007/s00117-025-01499-x
PMID:40851045
Abstract

BACKGROUND

The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.

OBJECTIVE

This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.

METHODS

A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.

RESULTS

GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.

CONCLUSION

When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.

摘要

背景

博斯尼亚克分类系统广泛用于评估肾囊性病变的恶性风险,但观察者间的差异带来了重大挑战。当提供文本描述(如放射学报告中的描述)时,大语言模型(LLMs)可能会提供一种标准化的分类方法。

目的

本研究评估了五种大语言模型——GPT-4(ChatGPT)、Gemini、Copilot、Perplexity和NotebookLM——基于模拟CT报告内容的合成文本描述对肾囊肿进行分类的性能。

方法

使用既定的放射学标准构建了一个包含100个诊断场景(每个博斯尼亚克类别20例)的合成数据集。每个大语言模型使用零样本和少样本提示策略进行评估,而NotebookLM采用检索增强生成(RAG)。性能指标包括准确性、敏感性和特异性。使用麦克尼马尔检验和卡方检验评估统计学意义。

结果

GPT-4的准确性最高(零样本时为87%,少样本时为99%),其次是Copilot(81%-86%)、Gemini(55%-69%)和Perplexity(43%-69%)。仅在RAG条件下测试的NotebookLM的准确性达到87%。少样本学习显著提高了性能(p<0.05)。博斯尼亚克IIF病变的分类在各模型中仍然具有挑战性。

结论

当提供结构良好的文本描述时,大语言模型可以准确地对肾囊肿进行分类。少样本提示显著提高了性能。然而,对诸如博斯尼亚克IIF等临界病变进行分类时持续存在的困难凸显了进一步完善和进行现实世界验证的必要性。

相似文献

1
Bosniak classification of renal cysts using large language models: a comparative study.使用大语言模型进行肾囊肿的博斯尼亚克分类:一项比较研究。
Radiologie (Heidelb). 2025 Aug 24. doi: 10.1007/s00117-025-01499-x.
2
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
3
Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study.使用检索增强大语言模型预测术后30天死亡率和美国麻醉医师协会身体状况:开发与验证研究
J Med Internet Res. 2025 Jun 3;27:e75052. doi: 10.2196/75052.
4
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
5
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
6
Assessment of Incidental Renal Cysts in Adults Undergoing Abdominal CT for Non-urological Indications.对因非泌尿系统指征接受腹部CT检查的成人偶然发现的肾囊肿的评估。
Cureus. 2025 Jun 29;17(6):e86952. doi: 10.7759/cureus.86952. eCollection 2025 Jun.
7
Using a Large Language Model for Breast Imaging Reporting and Data System Classification and Malignancy Prediction to Enhance Breast Ultrasound Diagnosis: Retrospective Study.使用大语言模型进行乳腺影像报告和数据系统分类及恶性肿瘤预测以增强乳腺超声诊断:回顾性研究
JMIR Med Inform. 2025 Jun 11;13:e70924. doi: 10.2196/70924.
8
Leveraging Retrieval-Augmented Large Language Models for Dietary Recommendations With Traditional Chinese Medicine's Medicine Food Homology: Algorithm Development and Validation.利用检索增强大语言模型结合中医药食同源进行饮食推荐:算法开发与验证
JMIR Med Inform. 2025 Aug 21;13:e75279. doi: 10.2196/75279.
9
Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.大语言模型在放射实践认证考试中的性能比较分析
Radiol Technol. 2025 May-Jun;96(5):334-342.
10
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

本文引用的文献

1
Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought.通过生成增强检索和分层思维链赋能大型语言模型进行自动化临床评估。
Artif Intell Med. 2025 Apr;162:103078. doi: 10.1016/j.artmed.2025.103078. Epub 2025 Feb 12.
2
Large Language Model Enhanced Logic Tensor Network for Stance Detection.用于立场检测的大语言模型增强逻辑张量网络
Neural Netw. 2025 Mar;183:106956. doi: 10.1016/j.neunet.2024.106956. Epub 2024 Nov 29.
3
Interrater agreement and reliability of the Bosniak classification for cystic renal masses version 2019.
2019版囊性肾肿块的博斯尼亚克分类的评估者间一致性和可靠性
Urol Oncol. 2025 Mar;43(3):192.e1-192.e10. doi: 10.1016/j.urolonc.2024.10.011. Epub 2024 Oct 28.
4
The Future of AI and Informatics in Radiology: 10 Predictions.放射学中人工智能与信息学的未来:十大预测。
Radiology. 2023 Oct;309(1):e231114. doi: 10.1148/radiol.231114.
5
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.ChatGPT 让医学文献通俗易懂:简化放射学报告的探索性案例研究。
Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.
6
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
7
An International Non-Inferiority Study for the Benchmarking of AI for Routine Radiology Cases: Chest X-ray, Fluorography and Mammography.一项针对常规放射学病例(胸部X光、荧光透视和乳房X光检查)人工智能基准测试的国际非劣效性研究。
Healthcare (Basel). 2023 Jun 8;11(12):1684. doi: 10.3390/healthcare11121684.
8
Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review.电子健康记录中与医疗决策相关的自然语言处理:一项系统综述。
Comput Biol Med. 2023 Mar;155:106649. doi: 10.1016/j.compbiomed.2023.106649. Epub 2023 Feb 10.
9
Interrater Agreement of Bosniak Classification Version 2019 and Version 2005 for Cystic Renal Masses at CT and MRI.CT 和 MRI 囊性肾肿块 Bosniak 分类 2019 版和 2005 版的观察者间一致性。
Radiology. 2022 Feb;302(2):357-366. doi: 10.1148/radiol.2021210853. Epub 2021 Nov 2.
10
Bosniak classification of cystic renal masses, version 2019: interpretation pitfalls and recommendations to avoid misclassification.Bosniak 分类法在囊性肾脏肿块中的应用(2019 年版):解读误区与避免误分类的建议。
Abdom Radiol (NY). 2021 Jun;46(6):2699-2711. doi: 10.1007/s00261-020-02906-8. Epub 2021 Jan 23.