致癌基因驱动的非小细胞肺癌中生成式预训练Transformer模型的比较分析：引入生成式人工智能性能评分

Comparative Analysis of Generative Pre-Trained Transformer Models in Oncogene-Driven Non-Small Cell Lung Cancer: Introducing the Generative Artificial Intelligence Performance Score.

作者信息

Hamilton Zacharie, Aseem Aseem, Chen Zhengjia, Naffakh Noor, Reizine Natalie M, Weinberg Frank, Jain Shikha, Kessler Larry G, Gadi Vijayakrishna K, Bun Christopher, Nguyen Ryan H

机构信息

University of Illinois Chicago, Chicago, IL.

University of Washington, Seattle, WA.

出版信息

JCO Clin Cancer Inform. 2024 Dec;8:e2400123. doi: 10.1200/CCI.24.00123. Epub 2024 Dec 11.

DOI:10.1200/CCI.24.00123

PMID:39661913

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11634130/

Abstract

PURPOSE

Precision oncology in non-small cell lung cancer (NSCLC) relies on biomarker testing for clinical decision making. Despite its importance, challenges like the lack of genomic oncology training, nonstandardized biomarker reporting, and a rapidly evolving treatment landscape hinder its practice. Generative artificial intelligence (AI), such as ChatGPT, offers promise for enhancing clinical decision support. Effective performance metrics are crucial to evaluate these models' accuracy and their propensity for producing incorrect or hallucinated information. We assessed various ChatGPT versions' ability to generate accurate next-generation sequencing reports and treatment recommendations for NSCLC, using a novel Generative AI Performance Score (G-PS), which considers accuracy, relevancy, and hallucinations.

METHODS

We queried ChatGPT versions for first-line NSCLC treatment recommendations with an Food and Drug Administration-approved targeted therapy, using a zero-shot prompt approach for eight oncogenes. Responses were assessed against National Comprehensive Cancer Network (NCCN) guidelines for accuracy, relevance, and hallucinations, with G-PS calculating scores from -1 (all hallucinations) to 1 (fully NCCN-compliant recommendations). G-PS was designed as a composite measure with a base score for correct recommendations (weighted for preferred treatments) and a penalty for hallucinations.

RESULTS

Analyzing 160 responses, generative pre-trained transformer (GPT)-4 outperformed GPT-3.5, showing higher base score (90% 60%; < .01) and fewer hallucinations (34% 53%; < .01). GPT-4's overall G-PS was significantly higher (0.34 -0.15; < .01), indicating superior performance.

CONCLUSION

This study highlights the rapid improvement of generative AI in matching treatment recommendations with biomarkers in precision oncology. Although the rate of hallucinations improved in the GPT-4 model, future generative AI use in clinical care requires high levels of accuracy with minimal to no room for hallucinations. The GP-S represents a novel metric quantifying generative AI utility in health care compared with national guidelines, with potential adaptation beyond precision oncology.

摘要

目的

非小细胞肺癌（NSCLC）的精准肿瘤学依赖生物标志物检测来进行临床决策。尽管其很重要，但诸如缺乏基因组肿瘤学培训、生物标志物报告不规范以及治疗格局迅速演变等挑战阻碍了其实际应用。生成式人工智能（AI），如ChatGPT，有望增强临床决策支持。有效的性能指标对于评估这些模型的准确性以及它们产生错误或幻觉信息的倾向至关重要。我们使用一种新颖的生成式人工智能性能评分（G-PS）评估了各种ChatGPT版本生成准确的非小细胞肺癌下一代测序报告和治疗建议的能力，该评分考虑了准确性、相关性和幻觉。

方法

我们使用零样本提示方法针对8种致癌基因向ChatGPT版本询问一线非小细胞肺癌治疗建议及美国食品药品监督管理局批准的靶向治疗方法。根据美国国立综合癌症网络（NCCN）指南对回答的准确性、相关性和幻觉进行评估，G-PS计算从-1（全是幻觉）到1（完全符合NCCN指南的建议）的分数。G-PS被设计为一种综合度量，有正确建议的基础分数（根据首选治疗加权）和对幻觉的惩罚。

结果

分析160个回答，生成式预训练变换器（GPT）-4的表现优于GPT-3.5，基础分数更高（90%对60%；P <.01）且幻觉更少（34%对53%；P <.01）。GPT-4的总体G-PS显著更高（0.34对-0.15；P <.01），表明性能更优。

结论

本研究突出了生成式人工智能在精准肿瘤学中将治疗建议与生物标志物相匹配方面的快速进步。尽管GPT-4模型中的幻觉发生率有所改善，但未来在临床护理中使用生成式人工智能需要高度准确，几乎没有幻觉空间。与国家指南相比，G-PS代表了一种量化生成式人工智能在医疗保健中效用的新指标，有可能在精准肿瘤学之外进行调整应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/207f/11634130/520050014c1f/cci-8-e2400123-g001.jpg

相似文献

Comparative Analysis of Generative Pre-Trained Transformer Models in Oncogene-Driven Non-Small Cell Lung Cancer: Introducing the Generative Artificial Intelligence Performance Score.致癌基因驱动的非小细胞肺癌中生成式预训练Transformer模型的比较分析：引入生成式人工智能性能评分

JCO Clin Cancer Inform. 2024 Dec;8:e2400123. doi: 10.1200/CCI.24.00123. Epub 2024 Dec 11.

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索：为医疗 AI 铺平道路。

Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.GPT-4人工智能模型在类似神经外科书面考试的问题上表现优于ChatGPT、医学生和神经外科住院医师。

World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.

Enhanced Artificial Intelligence Strategies in Renal Oncology: Iterative Optimization and Comparative Analysis of GPT 3.5 Versus 4.0.增强型人工智能策略在肾肿瘤学中的应用：GPT 3.5 与 4.0 的迭代优化与比较分析。

Ann Surg Oncol. 2024 Jun;31(6):3887-3893. doi: 10.1245/s10434-024-15107-0. Epub 2024 Mar 12.

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports.评估生成式预训练变换器4（GPT-4）在规范放射学报告方面的性能。

Eur Radiol. 2024 Jun;34(6):3566-3574. doi: 10.1007/s00330-023-10384-x. Epub 2023 Nov 8.

Performance Evaluation of the Generative Pre-trained Transformer (GPT-4) on the Family Medicine In-Training Examination.生成式预训练转换器（GPT-4）在家庭医学住院医师考试中的性能评估。

J Am Board Fam Med. 2024 Oct 25;37(4):528-582. doi: 10.3122/jabfm.2023.230433R1.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展：GPT-4 在骨科手术委员会问题上的表现。

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge.Gemini人工智能与ChatGPT对比：与眼科住院医师一起对医学知识进行的全面考察

Graefes Arch Clin Exp Ophthalmol. 2025 Feb;263(2):527-536. doi: 10.1007/s00417-024-06625-4. Epub 2024 Sep 15.

Precision Oncology in Non-small Cell Lung Cancer: A Comparative Study of Contextualized ChatGPT Models.非小细胞肺癌中的精准肿瘤学：情境化ChatGPT模型的比较研究

Cureus. 2025 Mar 24;17(3):e81097. doi: 10.7759/cureus.81097. eCollection 2025 Mar.

Exploring the landscape of AI-assisted decision-making in head and neck cancer treatment: a comparative analysis of NCCN guidelines and ChatGPT responses.探索人工智能辅助头颈部癌症治疗决策的全景：NCCN 指南与 ChatGPT 回复的比较分析。

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2123-2136. doi: 10.1007/s00405-024-08525-z. Epub 2024 Feb 29.

引用本文的文献

DeepSeek's impact on thoracic surgeons' work patterns-past, present and future.深寻对胸外科医生工作模式的影响——过去、现在与未来。

J Thorac Dis. 2025 Feb 28;17(2):1114-1117. doi: 10.21037/jtd-2025b-04.

本文引用的文献

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。

J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.

The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research.ChatGPT 的温度特征：为临床研究修改创造力。

JMIR Hum Factors. 2024 Mar 8;11:e53559. doi: 10.2196/53559.

Precision Oncology: 2023 in Review.精准肿瘤学：2023 年度回顾。

Cancer Discov. 2023 Dec 12;13(12):2525-2531. doi: 10.1158/2159-8290.CD-23-1194.

Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination.评估 GPT-3.5 和 GPT-4 在波兰医学期末考试中的表现。

Sci Rep. 2023 Nov 22;13(1):20512. doi: 10.1038/s41598-023-46995-z.

Use of Artificial Intelligence Chatbots for Cancer Treatment Information.使用人工智能聊天机器人获取癌症治疗信息。

JAMA Oncol. 2023 Oct 1;9(10):1459-1462. doi: 10.1001/jamaoncol.2023.2954.

Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer.评估人工智能聊天机器人对癌症热门搜索查询的响应

JAMA Oncol. 2023 Oct 1;9(10):1437-1440. doi: 10.1001/jamaoncol.2023.2947.

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.评估 ChatGPT 在整个临床工作流程中的效用：开发和可用性研究。

J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.聊天机器人与医学生在自由应答临床推理考试中的表现对比

JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.

Large language models encode clinical knowledge.大语言模型编码临床知识。

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

致癌基因驱动的非小细胞肺癌中生成式预训练Transformer模型的比较分析：引入生成式人工智能性能评分

Comparative Analysis of Generative Pre-Trained Transformer Models in Oncogene-Driven Non-Small Cell Lung Cancer: Introducing the Generative Artificial Intelligence Performance Score.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献