Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Parktown, Johannesburg, South Africa.
Medical Research Council/Wits University Rural Public Health and Health Transitions Research Unit (Agincourt), School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa.
Front Public Health. 2021 Jul 7;9:694306. doi: 10.3389/fpubh.2021.694306. eCollection 2021.
South Africa (SA) has the highest incidence of colorectal cancer (CRC) in Sub-Saharan Africa (SSA). However, there is limited research on CRC recurrence and survival in SA. CRC recurrence and overall survival are highly variable across studies. Accurate prediction of patients at risk can enhance clinical expectations and decisions within the South African CRC patients population. We explored the feasibility of integrating statistical and machine learning (ML) algorithms to achieve higher predictive performance and interpretability in findings. We selected and compared six algorithms:- logistic regression (LR), naïve Bayes (NB), C5.0, random forest (RF), support vector machine (SVM) and artificial neural network (ANN). Commonly selected features based on OneR and information gain, within 10-fold cross-validation, were used for model development. The validity and stability of the predictive models were further assessed using simulated datasets. The six algorithms achieved high discriminative accuracies (AUC-ROC). ANN achieved the highest AUC-ROC for recurrence (87.0%) and survival (82.0%), and other models showed comparable performance with ANN. We observed no statistical difference in the performance of the models. Features including radiological stage and patient's age, histology, and race are risk factors of CRC recurrence and patient survival, respectively. Based on other studies and what is known in the field, we have affirmed important predictive factors for recurrence and survival using rigorous procedures. Outcomes of this study can be generalised to CRC patient population elsewhere in SA and other SSA countries with similar patient profiles.
南非(SA)是撒哈拉以南非洲(SSA)中结直肠癌(CRC)发病率最高的国家。然而,关于 SA 中 CRC 复发和生存的研究有限。CRC 复发和总体生存率在研究中差异很大。准确预测高危患者可以提高南非 CRC 患者人群的临床预期和决策能力。我们探讨了整合统计和机器学习(ML)算法以提高发现结果的预测性能和可解释性的可行性。我们选择并比较了六种算法:-逻辑回归(LR)、朴素贝叶斯(NB)、C5.0、随机森林(RF)、支持向量机(SVM)和人工神经网络(ANN)。在 10 倍交叉验证中,根据 OneR 和信息增益选择常见的特征用于模型开发。使用模拟数据集进一步评估预测模型的有效性和稳定性。六种算法均实现了较高的判别准确率(AUC-ROC)。ANN 在复发(87.0%)和生存(82.0%)方面的 AUC-ROC 最高,其他模型与 ANN 的表现相当。我们没有观察到模型性能的统计学差异。包括影像学分期和患者年龄、组织学和种族在内的特征分别是 CRC 复发和患者生存的危险因素。基于其他研究和该领域的已知情况,我们使用严格的程序确认了复发和生存的重要预测因素。本研究的结果可以推广到南非和其他具有类似患者特征的 SSA 国家的 CRC 患者人群。