Mülder Duco T, van den Puttelaar Rosita, Meester Reinier G S, O'Mahony James F, Lansdorp-Vogelaar Iris
Department of Public Health, Erasmus Medical Center, Rotterdam, Netherlands.
Department of Public Health, Erasmus Medical Center, Rotterdam, Netherlands.
Int J Med Inform. 2023 Oct;178:105194. doi: 10.1016/j.ijmedinf.2023.105194. Epub 2023 Aug 16.
Identification of individuals at elevated risk can improve cancer screening programmes by permitting risk-adjusted screening intensities. Previous work introduced a prognostic model using sex, age and two preceding faecal haemoglobin concentrations to predict the risk of colorectal cancer (CRC) in the next screening round. Using data of 3 screening rounds, this model attained an area under the receiver-operating-characteristic curve (AUC) of 0.78 for predicting advanced neoplasia (AN). We validated this existing logistic regression (LR) model and attempted to improve it by applying a more flexible machine-learning approach.
We trained an existing LR and a newly developed random forest (RF) model using updated data from 219,257 third-round participants of the Dutch CRC screening programme until 2018. For both models, we performed two separate out-of-sample validations using 1,137,599 third-round participants after 2018 and 192,793 fourth-round participants from 2020 onwards. We evaluated the AUC and relative risks of the predicted high-risk groups for the outcomes AN and CRC.
For third-round participants after 2018, the AUC for predicting AN was 0.77 (95% CI: 0.76-0.77) using LR and 0.77 (95% CI: 0.77-0.77) using RF. For fourth-round participants, the AUCs were 0.73 (95% CI: 0.72-0.74) and 0.73 (95% CI: 0.72-0.74) for the LR and RF models, respectively. For both models, the 5% with the highest predicted risk had a 7-fold risk of AN compared to average, whereas the lowest 80% had a risk below the population average for third-round participants.
The LR is a valid risk prediction method in stool-based screening programmes. Although predictive performance declined marginally, the LR model still effectively predicted risk in subsequent screening rounds. An RF did not improve CRC risk prediction compared to an LR, probably due to the limited number of available explanatory variables. The LR remains the preferred prediction tool because of its interpretability.
识别高危个体可通过调整筛查强度来改进癌症筛查项目。此前的研究引入了一种预后模型,该模型利用性别、年龄和前两次粪便血红蛋白浓度来预测下一轮筛查中结直肠癌(CRC)的风险。利用三轮筛查的数据,该模型在预测进展期瘤变(AN)时,受试者操作特征曲线下面积(AUC)达到0.78。我们对现有的逻辑回归(LR)模型进行了验证,并尝试通过应用更灵活的机器学习方法对其进行改进。
我们使用荷兰CRC筛查项目截至2018年的219,257名三轮参与者的更新数据,训练了现有的LR模型和新开发的随机森林(RF)模型。对于这两个模型,我们使用2018年后的1,137,599名三轮参与者和2020年起的192,793名四轮参与者进行了两次独立的样本外验证。我们评估了预测高危组发生AN和CRC结局的AUC和相对风险。
对于2018年后的三轮参与者,使用LR模型预测AN的AUC为0.77(95%CI:0.76 - 0.77),使用RF模型为0.77(95%CI:0.77 - 0.77)。对于四轮参与者,LR模型和RF模型的AUC分别为0.73(95%CI:0.72 - 0.74)和0.73(95%CI:0.72 - 0.74)。对于这两个模型,预测风险最高的5%人群发生AN的风险是平均水平的7倍,而对于三轮参与者,预测风险最低的80%人群的风险低于总体平均水平。
LR是基于粪便的筛查项目中有效的风险预测方法。尽管预测性能略有下降,但LR模型在后续筛查轮次中仍能有效预测风险。与LR相比,RF并没有改善CRC风险预测,这可能是由于可用解释变量数量有限。由于其可解释性,LR仍然是首选的预测工具。