Lee Eunsaem, Jung Se Young, Hwang Hyung Ju, Jung Jaewoo
Department of Mathematics, Pohang University of Science and Technology, Pohang-si, Republic of Korea.
Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea.
JMIR Med Inform. 2021 Aug 30;9(8):e29807. doi: 10.2196/29807.
Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level prediction models should be developed.
We aimed to develop cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real-world environments.
As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health checkup every 2 years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, and previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, survival analysis, and one-class embedding classifier methods to effectively analyze high dimension data based on deep learning-based anomaly detection. Performance was measured with area under the curve and area under precision recall curve. We validated our models externally with a health checkup database from a tertiary hospital.
The one-class embedding classifier model received the highest area under the curve scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749, and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. For area under precision recall curve, the light gradient boosting models had the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357, and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively.
Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The 7 models showed acceptable performances and explainability, and thus can be distributed easily in real-world environments.
基于全国人口的队列为在患者层面构建自动化风险预测模型提供了新机会,而索赔数据是实现这一目标的更有用资源之一。为避免癌症筛查测试后不必要的诊断干预,应开发患者层面的预测模型。
我们旨在使用全国索赔数据库和机器学习算法开发癌症预测模型,这些模型具有可解释性且易于在现实环境中应用。
作为源数据,我们使用了韩国国民保险系统数据库。每位40岁及以上的韩国人每两年接受一次全国健康检查。我们从数据库中收集了所有变量,包括人口统计学信息、基本实验室值、人体测量值和既往病史。我们应用传统逻辑回归方法、轻梯度提升方法、神经网络、生存分析和单类嵌入分类器方法,以基于深度学习的异常检测有效分析高维数据。性能通过曲线下面积和精确召回率曲线下面积来衡量。我们使用一家三级医院的健康检查数据库对模型进行外部验证。
单类嵌入分类器模型在曲线下面积得分方面最高,肝癌、肺癌、结直肠癌、胰腺癌、胃癌、乳腺癌和宫颈癌的得分分别为0.868、0.849、0.798、0.746、0.800、0.749和0.790。在精确召回率曲线下面积方面,轻梯度提升模型得分最高,肝癌、肺癌、结直肠癌、胰腺癌、胃癌、乳腺癌和宫颈癌的得分分别为0.383、0.401、0.387、0.300、0.385、0.357和0.296。
我们的结果表明,使用机器学习利用全国索赔数据轻松开发适用的癌症预测模型是可行的。这7个模型表现出可接受的性能和可解释性,因此可以在现实环境中轻松分发。