使用机器学习和DeepSeek-R1对胆囊癌进行术前T分期鉴别
Pre-operative T-stage discrimination in gallbladder cancer using machine learning and DeepSeek-R1.
作者信息
Chae Joongwon, Wang Zhenyu, Wu Duanpo, Zhang Lian, Tuzikov Alexander, Madiyevich Magrupov Talat, Xu Min, Yu Dongmei, Qin Peiwu
机构信息
Institute of Biopharmaceutical and Health Engineering, Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong, China.
School of Communication Engineering and the Artificial Intelligence Institute, Hangzhou Dianzi University, Hangzhou, Zhejiang, China.
出版信息
Front Oncol. 2025 Aug 1;15:1613462. doi: 10.3389/fonc.2025.1613462. eCollection 2025.
BACKGROUND
Gallbladder cancer (GBC) frequently exhibits non-specific early symptoms, delaying diagnosis. This study (i) assessed whether routine blood biomarkers can distinguish early T stages via machine learning and (ii) compared the T-stage discrimination performance of a large language model (DeepSeek-R1) when supplied with (a) radiology-report text alone versus (b) radiology-report text plus blood-biomarker values.
METHODS
We retrospectively analyzed 232 pathologically confirmed GBC patients treated at Lishui Central Hospital between 2023 and 2024 (T1, = 51; T2, = 181). Seven blood variables-neutrophil-to-lymphocyte ratio (NLR), monocyte-to-lymphocyte ratio (MLR), platelet-tolymphocyte ratio (PLR), carcino-embryonic antigen (CEA), carbohydrate antigen 19-9 (CA19-9), carbohydrate antigen 125 (CA125), and alpha-fetoprotein (AFP)-were used to train Random forest, Support Vector Machine (SVC), XGBoost, and LightGBM models. Synthetic Minority Over-sampling Technique (SMOTE) was applied only to the training folds in one setting and omitted in another. Model performance was evaluated on an independent test set ( = 47) by the area under the receiver-operating-characteristic curve (AUROC, 95% CI by 1 000-sample bootstrap confidence interval, CI); cross-validation (CV) accuracy served as a supplementary metric. DeepSeek-R1 was prompted in a zero-shot, chain-of-thought manner to classify T1 versus T2 using (a) the radiology report alone or (b) the report plus the patient's biomarker profile.
RESULTS
Biomarker-based machine-learning models yielded uniformly poor T-stage discrimination. Without SMOTE, individual models such as XGBoost achieved an AUROC of 0.508 on the independent test set, while recall for the T1 class remained low (e.g., 14.3% for some models), indicating performance near random chance. Applying SMOTE to the training data produced statistically significant gains in cross-validation (CV) accuracy for several models (e.g., XGBoost CV Acc. 0.71 → 0.80, = 0.005; LGBM CV Acc. [] → [], = 0.004). However, these improvements did not translate to better discrimination on the independent test set; for instance, XGBoost's AUROC decreased from 0.508 to 0.473 after SMOTE application. Overall, the biomarker models failed to provide clinically meaningful T-stage differentiation. DeepSeek-R1 analyzing radiology text alone reached 89.6% accuracy on the full 232-patient cohort dataset, and consistently flagged T2 cases on phrases such as "gallbladder wall thickening." Supplying biomarker values did not change accuracy (89.6%).
CONCLUSIONS
The evaluated blood biomarkers did independently aid early T-stage discrimination, and SMOTE offered no meaningful performance gain. Conversely, a radiologytext-driven large language model delivered high accuracy with interpretable rationale, highlighting its potential to guide surgical strategy in GBC. Prospective multi-center studies with larger cohorts are warranted to confirm these findings.
背景
胆囊癌(GBC)通常表现出非特异性的早期症状,从而延误诊断。本研究(i)评估常规血液生物标志物是否可通过机器学习区分早期T分期,以及(ii)比较大型语言模型(DeepSeek-R1)在提供(a)仅放射学报告文本与(b)放射学报告文本加血液生物标志物值时的T分期判别性能。
方法
我们回顾性分析了2023年至2024年在丽水市中心医院接受治疗的232例经病理确诊的GBC患者(T1期,n = 51;T2期,n = 181)。使用七个血液变量——中性粒细胞与淋巴细胞比值(NLR)、单核细胞与淋巴细胞比值(MLR)、血小板与淋巴细胞比值(PLR)、癌胚抗原(CEA)、糖类抗原19-9(CA19-9)、糖类抗原125(CA125)和甲胎蛋白(AFP)——训练随机森林、支持向量机(SVC)、XGBoost和LightGBM模型。合成少数过采样技术(SMOTE)仅在一种设置下应用于训练折,而在另一种设置中省略。通过受试者操作特征曲线下面积(AUROC,95% CI通过1000样本自助置信区间,CI)在独立测试集(n = 47)上评估模型性能;交叉验证(CV)准确性作为补充指标。以零样本、思维链方式提示DeepSeek-R1使用(a)仅放射学报告或(b)报告加患者生物标志物概况对T1与T2进行分类。
结果
基于生物标志物的机器学习模型在T分期判别方面均表现不佳。在没有SMOTE的情况下,诸如XGBoost等单个模型在独立测试集上的AUROC为0.508,而T1类别的召回率仍然较低(例如,某些模型为14.3%),表明性能接近随机水平。将SMOTE应用于训练数据在几个模型的交叉验证(CV)准确性方面产生了统计学上的显著提高(例如,XGBoost CV Acc. 0.71 → 0.80,p = 0.005;LGBM CV Acc. [] → [],p = 0.004)。然而,这些改进并未转化为在独立测试集上更好的判别;例如,应用SMOTE后XGBoost的AUROC从0.508降至0.473。总体而言,生物标志物模型未能提供具有临床意义的T分期区分。仅分析放射学文本的DeepSeek-R1在完整的232例患者队列数据集上的准确率达到89.6%,并始终在诸如“胆囊壁增厚”等短语上标记T2病例。提供生物标志物值并未改变准确率(89.6%)。
结论
所评估的血液生物标志物无助于独立进行早期T分期判别,且SMOTE未带来有意义的性能提升。相反,基于放射学文本的大型语言模型提供了具有可解释原理的高精度,突出了其在指导GBC手术策略方面的潜力。有必要进行更大队列的前瞻性多中心研究以证实这些发现。
相似文献
Cochrane Database Syst Rev. 2022-5-20
Clin Orthop Relat Res. 2024-9-1
Cochrane Database Syst Rev. 2021-4-19
Cochrane Database Syst Rev. 2018-2-6
本文引用的文献
JCO Clin Cancer Inform. 2024-12
J Cancer Res Clin Oncol. 2024-10-6
Indian J Gastroenterol. 2024-8
Ecancermedicalscience. 2024-1-30
Nature. 2023-8
Diagnostics (Basel). 2023-2-13
Nat Rev Dis Primers. 2022-10-27
Brief Bioinform. 2022-11-19
Front Oncol. 2022-7-27