Hart Gregory R, Yan Vanessa, Huang Gloria S, Liang Ying, Nartowt Bradley J, Muhammad Wazir, Deng Jun
Department of Therapeutic Radiology, Yale University, New Haven, CT, U.S.A.
Department of Statistics and Data Science, Yale University, New Haven, CT, U.S.A.
Front Artif Intell. 2020 Nov 24;3:539879. doi: 10.3389/frai.2020.539879. eCollection 2020.
Incidence and mortality rates of endometrial cancer are increasing, leading to increased interest in endometrial cancer risk prediction and stratification to help in screening and prevention. Previous risk models have had moderate success with the area under the curve (AUC) ranging from 0.68 to 0.77. Here we demonstrate a population-based machine learning model for endometrial cancer screening that achieves a testing AUC of 0.96. We train seven machine learning algorithms based solely on personal health data, without any genomic, imaging, biomarkers, or invasive procedures. The data come from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO). We further compare our machine learning model with 15 gynecologic oncologists and primary care physicians in the stratification of endometrial cancer risk for 100 women. We find a random forest model that achieves a testing AUC of 0.96 and a neural network model that achieves a testing AUC of 0.91. We test both models in risk stratification against 15 practicing physicians. Our random forest model is 2.5 times better at identifying above-average risk women with a 2-fold reduction in the false positive rate. Our neural network model is 2 times better at identifying above-average risk women with a 3-fold reduction in the false positive rate. Our machine learning models provide a non-invasive and cost-effective way to identify high-risk sub-populations who may benefit from early screening of endometrial cancer, prior to disease onset. Through statistical biopsy of personal health data, we have identified a new and effective approach for early cancer detection and prevention for individual patients.
子宫内膜癌的发病率和死亡率正在上升,这使得人们对子宫内膜癌风险预测和分层的兴趣增加,以助力筛查和预防工作。以往的风险模型取得了一定成功,曲线下面积(AUC)在0.68至0.77之间。在此,我们展示了一种基于人群的用于子宫内膜癌筛查的机器学习模型,其测试AUC达到了0.96。我们仅基于个人健康数据训练了七种机器学习算法,未使用任何基因组、成像、生物标志物或侵入性检查。数据来自前列腺、肺、结肠和卵巢癌筛查试验(PLCO)。我们进一步将我们的机器学习模型与15名妇科肿瘤学家和初级保健医生对100名女性的子宫内膜癌风险分层情况进行比较。我们发现一个随机森林模型的测试AUC为0.96,一个神经网络模型的测试AUC为0.91。我们针对15名执业医生对这两个模型进行了风险分层测试。我们的随机森林模型在识别高于平均风险女性方面的表现要好2.5倍,假阳性率降低了2倍。我们的神经网络模型在识别高于平均风险女性方面的表现要好2倍,假阳性率降低了3倍。我们的机器学习模型提供了一种非侵入性且具有成本效益的方法,用于识别可能从子宫内膜癌疾病发作前的早期筛查中受益的高危亚人群。通过对个人健康数据进行统计性剖析,我们为个体患者确定了一种新的、有效的早期癌症检测和预防方法。