Su Jie, Lu Hui, Zhang Ruihuan, Cui Na, Chen Chao, Si Qin, Song Biao
Medical neurobiology laboratory, Inner Mongolia Medical University, Huhhot, 010030, China.
College of Computer Science, Inner Mongolia University, Hohhot, 010021, China.
Sci Rep. 2025 Jul 2;15(1):22655. doi: 10.1038/s41598-025-08166-0.
Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable).
宫颈癌(CC)是全球女性中第四大常见癌症。预防和治疗宫颈癌的关键在于早期检测、诊断和治疗。本研究旨在开发一种可解释的模型,用于利用常规血液数据预测宫颈癌风险。主要终点变量是经组织病理学诊断确诊的宫颈癌的发生情况。我们使用夏普利值附加解释(SHAP)方法来提供可解释性,并识别与宫颈癌相关的关键因素。在这项回顾性研究中,收集了2013年至2023年患者的病历。病例组共纳入2503例诊断为宫颈癌的患者,而对照组由3794例无明显疾病迹象的患者组成,其中包括患有其他妇科疾病的女性以及接受常规检查的健康个体。考虑了年龄、临床诊断信息和22项血细胞分析结果。应用四种不同的算法构建了一个模型,用于估计宫颈癌发生的可能性。使用最小绝对收缩和选择算子(LASSO)和随机森林方法(RF),最终从最初的23个特征集中选择了15个关键的常规血液特征用于模型训练。这些特征包括年龄、红细胞计数(RBC)、血小板分布宽度(PDW)、白细胞计数(WBC)、淋巴细胞百分比(LYMPH%)、嗜碱性粒细胞计数(BASO)、嗜碱性粒细胞百分比(BASO%)、淋巴细胞绝对值(LYMPH)、中性粒细胞百分比(NEUT%)、血红蛋白(HGB)、平均红细胞血红蛋白浓度(MCHC)、红细胞分布宽度(R-CV)、平均血小板体积(MPV)、血小板压积(PCT)。在这四个模型中,极端梯度提升(XGBoost)模型实现了最高的预测性能,曲线下面积(AUC)为0.964。相比之下,RF模型的泛化能力最差,AUC为0.907。SHAP方法根据重要性排名揭示了宫颈癌的前6个预测因子,平均血小板分布宽度(PDW)被认为是宫颈癌发生(主要终点变量)最重要的预测变量。
Clin Orthop Relat Res. 2024-9-1
Cochrane Database Syst Rev. 2022-5-20
J Pathol Inform. 2024-9-26
Nat Rev Cancer. 2024-1