Nartowt Bradley J, Hart Gregory R, Muhammad Wazir, Liang Ying, Stark Gigi F, Deng Jun
Department of Therapeutic Radiology, Yale University, New Haven, CT, United States.
Department of Radiation Oncology, Medial College of Wisconsin, Milwaukee, WI, United States.
Front Big Data. 2020 Mar 10;3:6. doi: 10.3389/fdata.2020.00006. eCollection 2020.
While colorectal cancer (CRC) is third in prevalence and mortality among cancers in the United States, there is no effective method to screen the general public for CRC risk. In this study, to identify an effective mass screening method for CRC risk, we evaluated seven supervised machine learning algorithms: linear discriminant analysis, support vector machine, naive Bayes, decision tree, random forest, logistic regression, and artificial neural network. Models were trained and cross-tested with the National Health Interview Survey (NHIS) and the Prostate, Lung, Colorectal, Ovarian Cancer Screening (PLCO) datasets. Six imputation methods were used to handle missing data: mean, Gaussian, Lorentzian, one-hot encoding, Gaussian expectation-maximization, and listwise deletion. Among all of the model configurations and imputation method combinations, the artificial neural network with expectation-maximization imputation emerged as the best, having a concordance of 0.70 ± 0.02, sensitivity of 0.63 ± 0.06, and specificity of 0.82 ± 0.04. In stratifying CRC risk in the NHIS and PLCO datasets, only 2% of negative cases were misclassified as high risk and 6% of positive cases were misclassified as low risk. In modeling the CRC-free probability with Kaplan-Meier estimators, low-, medium-, and high CRC-risk groups have statistically-significant separation. Our results indicated that the trained artificial neural network can be used as an effective screening tool for early intervention and prevention of CRC in large populations.
虽然结直肠癌(CRC)在美国癌症的患病率和死亡率中位列第三,但目前尚无有效的方法对普通公众进行CRC风险筛查。在本研究中,为了确定一种有效的CRC风险群体筛查方法,我们评估了七种监督式机器学习算法:线性判别分析、支持向量机、朴素贝叶斯、决策树、随机森林、逻辑回归和人工神经网络。使用美国国家健康访谈调查(NHIS)和前列腺、肺、结肠、卵巢癌筛查(PLCO)数据集对模型进行训练和交叉测试。采用六种插补方法处理缺失数据:均值、高斯、洛伦兹、独热编码、高斯期望最大化和列表删除。在所有模型配置和插补方法组合中,采用期望最大化插补的人工神经网络表现最佳,一致性为0.70±0.02,灵敏度为0.63±0.06,特异性为0.82±0.04。在对NHIS和PLCO数据集中的CRC风险进行分层时,只有2%的阴性病例被误分类为高风险,6%的阳性病例被误分类为低风险。在用Kaplan-Meier估计器对无CRC概率进行建模时,低、中、高CRC风险组有统计学上的显著差异。我们的结果表明,经过训练的人工神经网络可作为一种有效的筛查工具,用于在大量人群中对CRC进行早期干预和预防。