Hao Yuexing, Qiu Zhiwen, Holmes Jason, Löckenhoff Corinna E, Liu Wei, Ghassemi Marzyeh, Kalantari Saleh
Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, USA.
Cornell University, Ithaca, NY, USA.
NPJ Digit Med. 2025 Jul 17;8(1):450. doi: 10.1038/s41746-025-01824-7.
Large Language Models (LLMs) are increasingly used to support cancer patients and clinicians in decision-making. This systematic review investigates how LLMs are integrated into oncology and evaluated by researchers. We conducted a comprehensive search across PubMed, Web of Science, Scopus, and the ACM Digital Library through May 2024, identifying 56 studies covering 15 cancer types. The meta-analysis results suggested that LLMs were commonly used to summarize, translate, and communicate clinical information, but performance varied: the average overall accuracy was 76.2%, with average diagnostic accuracy lower at 67.4%, revealing gaps in the clinical readiness of this technology. Most evaluations relied heavily on quantitative datasets and automated methods without human graders, emphasizing "accuracy" and "appropriateness" while rarely addressing "safety", "harm", or "clarity". Current limitations for LLMs in cancer decision-making, such as limited domain knowledge and dependence on human oversight, demonstrate the need for open datasets and standardized evaluations to improve reliability.
大语言模型(LLMs)越来越多地被用于支持癌症患者和临床医生进行决策。本系统综述调查了大语言模型如何被整合到肿瘤学中以及研究人员对其的评估情况。我们在截至2024年5月的时间范围内,对PubMed、科学网、Scopus和ACM数字图书馆进行了全面检索,共识别出56项涵盖15种癌症类型的研究。荟萃分析结果表明,大语言模型通常用于总结、翻译和交流临床信息,但性能各不相同:平均总体准确率为76.2%,平均诊断准确率较低,为67.4%,这揭示了该技术在临床应用准备方面存在差距。大多数评估严重依赖定量数据集和自动化方法,没有人工评分,强调“准确性”和“适当性”,而很少涉及“安全性”“危害”或“清晰度”。大语言模型在癌症决策中的当前局限性,如领域知识有限和对人工监督的依赖,表明需要开放数据集和标准化评估以提高可靠性。