Guilin Medical University, Guilin, Guangxi, China.
West China Hospital, Chengdu, Sichuan, China.
Cancer Epidemiol Biomarkers Prev. 2023 Feb 6;32(2):274-280. doi: 10.1158/1055-9965.EPI-22-0792.
To expand nasopharyngeal carcinoma (NPC) screening to larger populations, more practical NPC risk prediction models independent of Epstein-Barr virus (EBV) and other lab tests are necessary.
Patient data before diagnosis of NPC were collected from hospital electronic medical records (EMR) and used to develop machine learning (ML) models for NPC risk prediction using XGBoost. NPC risk factor distributions were generated through connection delta ratio (CDR) analysis of patient graphs. By combining EMR-wide ML with patient graph analysis, the number of variables in these risk models was reduced, allowing for more practical NPC risk prediction ML models.
Using data collected from 1,357 patients with NPC and 1,448 patients with control, an optimal set of 100 variables (ov100) was determined for building NPC risk prediction ML models that had, the following performance metrics: 0.93-0.96 recall, 0.80-0.92 precision, and 0.83-0.94 AUC. Aided by the analysis of top CDR-ranked risk factors, the models were further refined to contain only 20 practical variables (pv20), excluding EBV. The pv20 NPC risk XGBoost model achieved 0.79 recall, 0.94 precision, 0.96 specificity, and 0.87 AUC.
This study demonstrated the feasibility of developing practical NPC risk prediction models using EMR-wide ML and patient graph CDR analysis, without requiring EBV data. These models could enable broader implementation of NPC risk evaluation and screening recommendations for larger populations in urban community health centers and rural clinics.
These more practical NPC risk models could help increase NPC screening rate and identify more patients with early-stage NPC.
为了将鼻咽癌(NPC)筛查扩大到更大的人群,有必要建立更多实用的、不依赖于 EBV 及其他实验室检测的 NPC 风险预测模型。
从医院电子病历(EMR)中收集 NPC 患者确诊前的临床数据,通过 XGBoost 建立 NPC 风险预测机器学习(ML)模型。通过患者图谱的连接差异比(CDR)分析,生成 NPC 风险因素分布。通过将 EMR 范围的 ML 与患者图谱分析相结合,减少了这些风险模型中的变量数量,从而建立了更实用的 NPC 风险预测 ML 模型。
使用 1357 例 NPC 患者和 1448 例对照患者的数据,确定了一个最佳的 100 个变量集(ov100),用于构建 NPC 风险预测 ML 模型,这些模型的性能指标如下:召回率为 0.93-0.96,精准率为 0.80-0.92,AUC 为 0.83-0.94。通过对排名靠前的 CDR 风险因素的分析,进一步将模型细化,仅包含 20 个实用变量(pv20),不包括 EBV。pv20 NPC 风险 XGBoost 模型的召回率为 0.79,精准率为 0.94,特异性为 0.96,AUC 为 0.87。
本研究证明了使用 EMR 范围的 ML 和患者图谱 CDR 分析开发实用的 NPC 风险预测模型的可行性,而无需 EBV 数据。这些模型可在城市社区卫生中心和农村诊所为更大的人群实施 NPC 风险评估和筛查建议提供依据。
这些更实用的 NPC 风险模型可以帮助提高 NPC 筛查率,识别更多早期 NPC 患者。