用于癌症基因变异分类的大语言模型GPT-4o、llama 3.1和文心一言2.5的基准测试

Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification.

作者信息

Lin Kuan-Hsun, Kao Tzu-Hang, Wang Lei-Chi, Kuo Chen-Tsung, Chen Paul Chih-Hsueh, Chu Yuan-Chia, Yeh Yi-Chen

机构信息

Department of Information Management, Taipei Veterans General Hospital, Taipei, Taiwan, ROC.

Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan, ROC.

出版信息

NPJ Precis Oncol. 2025 May 15;9(1):141. doi: 10.1038/s41698-025-00935-4.

DOI:10.1038/s41698-025-00935-4

PMID:40369023

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12078457/

Abstract

Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.

摘要

在精准肿瘤学中，基于临床可操作性对癌症基因变异进行分类至关重要，但也具有挑战性。大语言模型（LLMs）提供了潜在的解决方案，但其性能仍未得到充分探索。本研究评估了GPT-4o、Llama 3.1和Qwen 2.5在对来自OncoKB和CIViC数据库的基因变异以及从FoundationOne CDx报告中获得的真实世界数据集进行分类方面的性能。GPT-4o在区分临床相关变异与临床意义未明的变异（VUS）方面达到了最高准确率（0.7318），优于Qwen 2.5（0.5731）和Llama 3.1（0.4976）。对于具有强有力临床证据的变异，大语言模型与专家注释的一致性更好，但对于证据较弱的变异，一致性则较差。所有三个模型都表现出将变异归为更高证据水平的趋势，这表明存在过度分类的倾向。提示工程显著提高了准确率，而检索增强生成（RAG）进一步提升了性能。在100次迭代中的稳定性分析表明，与CIViC系统相比，与OncoKB的一致性更高。这些发现凸显了大语言模型在癌症基因变异分类中的前景，同时强调了进一步优化以提高准确性、一致性和临床适用性的必要性。