Daalen Florian van, Henriksen Margrethe Høstgaard Bang, Hansen Torben Frøstrup, Jensen Lars Henrik, Brasen Claus Lohman, Hilberg Ole, Andersen Martin Ask Klausholt, Humerfelt Elise, Wee Leonard, Bermejo Inigo
Department of Radiation Oncology (MAASTRO), GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, 6229 HX Maastricht, The Netherlands.
Department of Oncology, Vejle University Hospital, 7100 Vejle, Denmark.
Cancers (Basel). 2024 Nov 28;16(23):3989. doi: 10.3390/cancers16233989.
: Lung cancer (LC) is the leading cause of cancer mortality, making early diagnosis essential. While LC screening trials are underway globally, optimal prediction models and inclusion criteria are still lacking. This study aimed to develop and evaluate Bayesian Network (BN) models for LC risk prediction using a decade of data from Denmark. The primary goal was to assess BN performance on datasets varying in size and completeness, simulate real-world screening scenarios, and identify the most valuable data sources for LC screening. : The study included 38,944 patients evaluated for LC, with 11,284 (29%) diagnosed. Data on comorbidities, medications, and general practice were available for the entire cohort, while laboratory results, smoking habits, and other variables were only available for subsets. The cohort was divided into four subsets based on data availability, and BNs were trained and validated across these subsets using cross-validation and external validation. To determine the optimal combination of variables, all possible data combinations were evaluated on the samples that contained all the variables (n = 5587). : A model trained on the small, complete dataset (AUC 0.78) performed similarly on a larger dataset with 21% missing data (AUC 0.78). Performance dropped when 39% of data were missing (AUC 0.67), resulting in informative variables missing completely in the dataset. Laboratory results and smoking data were the most informative, significantly outperforming models based only on age and smoking status (AUC 0.70). : BN models demonstrated moderate to strong predictive performance, even with incomplete data, highlighting the potential value of incorporating laboratory results in LC screening programs.
肺癌(LC)是癌症死亡的主要原因,因此早期诊断至关重要。尽管全球范围内肺癌筛查试验正在进行,但仍缺乏最佳预测模型和纳入标准。本研究旨在利用丹麦十年的数据开发和评估用于肺癌风险预测的贝叶斯网络(BN)模型。主要目标是评估BN模型在大小和完整性不同的数据集上的性能,模拟现实世界的筛查场景,并确定肺癌筛查最有价值的数据源。
该研究纳入了38944名接受肺癌评估的患者,其中11284名(29%)被诊断为肺癌。整个队列都有合并症、用药情况和全科医疗的数据,而实验室检查结果、吸烟习惯和其他变量仅部分患者可用。根据数据可用性将队列分为四个子集,并使用交叉验证和外部验证在这些子集上对BN模型进行训练和验证。为了确定变量的最佳组合,在包含所有变量的样本(n = 5587)上评估了所有可能的数据组合。
在小的完整数据集上训练的模型(AUC 0.78)在缺失21%数据的较大数据集上表现相似(AUC 0.78)。当39%的数据缺失时性能下降(AUC 0.67),导致数据集中信息性变量完全缺失。实验室检查结果和吸烟数据信息性最强,显著优于仅基于年龄和吸烟状态的模型(AUC 0.70)。
BN模型即使在数据不完整的情况下也表现出中度到强的预测性能,突出了将实验室检查结果纳入肺癌筛查计划的潜在价值。