Sun Lin, Yang Lingping, Liu Xiyao, Tang Lan, Zeng Qi, Gao Yuwen, Chen Qian, Liu Zhaohai, Peng Bin
School of Public Health and Management, Chongqing Medical University, Chongqing, China.
Department of Obstetrics, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China.
Front Oncol. 2022 Feb 15;12:821453. doi: 10.3389/fonc.2022.821453. eCollection 2022.
The purpose is to accurately identify women at high risk of developing cervical cancer so as to optimize cervical screening strategies and make better use of medical resources. However, the predictive models currently in use require clinical physiological and biochemical indicators, resulting in a smaller scope of application. Stacking-integrated machine learning (SIML) is an advanced machine learning technique that combined multiple learning algorithms to improve predictive performance. This study aimed to develop a stacking-integrated model that can be used to identify women at high risk of developing cervical cancer based on their demographic, behavioral, and historical clinical factors.
The data of 858 women screened for cervical cancer at a Venezuelan Hospital were used to develop the SIML algorithm. The screening data were randomly split into training data (80%) that were used to develop the algorithm and testing data (20%) that were used to validate the accuracy of the algorithms. The random forest (RF) model and univariate logistic regression were used to identify predictive features for developing cervical cancer. Twelve well-known ML algorithms were selected, and their performances in predicting cervical cancer were compared. A correlation coefficient matrix was used to cluster the models based on their performance. The SIML was then developed using the best-performing techniques. The sensitivity, specificity, and area under the curve (AUC) of all models were calculated.
The RF model identified 18 features predictive of developing cervical cancer. The use of hormonal contraceptives was considered as the most important risk factor, followed by the number of pregnancies, years of smoking, and the number of sexual partners. The SIML algorithm had the best overall performance when compared with other methods and reached an AUC, sensitivity, and specificity of 0.877, 81.8%, and 81.9%, respectively.
This study shows that SIML can be used to accurately identify women at high risk of developing cervical cancer. This model could be used to personalize the screening program by optimizing the screening interval and care plan in high- and low-risk patients based on their demographics, behavioral patterns, and clinical data.
目的是准确识别患宫颈癌风险较高的女性,以优化宫颈癌筛查策略并更好地利用医疗资源。然而,目前使用的预测模型需要临床生理和生化指标,导致应用范围较小。堆叠集成机器学习(SIML)是一种先进的机器学习技术,它结合了多种学习算法以提高预测性能。本研究旨在开发一种基于人口统计学、行为和历史临床因素来识别患宫颈癌风险较高女性的堆叠集成模型。
使用委内瑞拉一家医院858名接受宫颈癌筛查的女性数据来开发SIML算法。筛查数据被随机分为用于开发算法的训练数据(80%)和用于验证算法准确性的测试数据(20%)。随机森林(RF)模型和单变量逻辑回归用于识别患宫颈癌的预测特征。选择了12种著名的机器学习算法,并比较了它们在预测宫颈癌方面的性能。使用相关系数矩阵根据模型性能对其进行聚类。然后使用性能最佳的技术开发SIML。计算所有模型的敏感性、特异性和曲线下面积(AUC)。
RF模型识别出18个患宫颈癌的预测特征。使用激素避孕药被认为是最重要的危险因素,其次是怀孕次数、吸烟年限和性伴侣数量。与其他方法相比,SIML算法具有最佳的整体性能,AUC、敏感性和特异性分别达到0.877、81.8%和81.9%。
本研究表明SIML可用于准确识别患宫颈癌风险较高的女性。该模型可用于通过根据高风险和低风险患者的人口统计学、行为模式和临床数据优化筛查间隔和护理计划来实现筛查计划的个性化。