Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany.
Institute of Genetic Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, Germany.
PLoS One. 2023 Jan 26;18(1):e0280399. doi: 10.1371/journal.pone.0280399. eCollection 2023.
The low five-year survival rate of pancreatic ductal adenocarcinoma (PDAC) and the low diagnostic rate of early-stage PDAC via imaging highlight the need to discover novel biomarkers and improve the current screening procedures for early diagnosis. Familial pancreatic cancer (FPC) describes the cases of PDAC that are present in two or more individuals within a circle of first-degree relatives. Using innovative high-throughput proteomics, we were able to quantify the protein profiles of individuals at risk from FPC families in different potential pre-cancer stages. However, the high-dimensional proteomics data structure challenges the use of traditional statistical analysis tools. Hence, we applied advanced statistical learning methods to enhance the analysis and improve the results' interpretability.
We applied model-based gradient boosting and adaptive lasso to deal with the small, unbalanced study design via simultaneous variable selection and model fitting. In addition, we used stability selection to identify a stable subset of selected biomarkers and, as a result, obtain even more interpretable results. In each step, we compared the performance of the different analytical pipelines and validated our approaches via simulation scenarios.
In the simulation study, model-based gradient boosting showed a more accurate prediction performance in the small, unbalanced, and high-dimensional datasets than adaptive lasso and could identify more relevant variables. Furthermore, using model-based gradient boosting, we discovered a subset of promising serum biomarkers that may potentially improve the current screening procedure of FPC.
Advanced statistical learning methods helped us overcome the shortcomings of an unbalanced study design in a valuable clinical dataset. The discovered serum biomarkers provide us with a clear direction for further investigations and more precise clinical hypotheses regarding the development of FPC and optimal strategies for its early detection.
胰腺导管腺癌(PDAC)五年生存率低,影像学对早期 PDAC 的诊断率低,这突出表明需要发现新的生物标志物,并改进目前的筛查程序以进行早期诊断。家族性胰腺癌(FPC)描述的是在一级亲属的范围内有两个或更多个体存在 PDAC 的情况。使用创新的高通量蛋白质组学,我们能够对处于不同潜在癌前阶段的 FPC 家族风险个体的蛋白质谱进行定量。然而,高维蛋白质组学数据结构对传统统计分析工具的使用提出了挑战。因此,我们应用了先进的统计学习方法来增强分析并提高结果的可解释性。
我们应用基于模型的梯度提升和自适应套索来处理通过同时变量选择和模型拟合来处理小的、不平衡的研究设计。此外,我们使用稳定性选择来识别所选生物标志物的稳定子集,并因此获得更具可解释性的结果。在每个步骤中,我们比较了不同分析管道的性能,并通过模拟场景验证了我们的方法。
在模拟研究中,基于模型的梯度提升在小的、不平衡的和高维数据集上表现出比自适应套索更准确的预测性能,并且能够识别更多相关变量。此外,使用基于模型的梯度提升,我们发现了一组有前途的血清生物标志物,这些生物标志物可能有潜力改善 FPC 的当前筛查程序。
先进的统计学习方法帮助我们克服了有价值的临床数据集中不平衡研究设计的缺点。所发现的血清生物标志物为我们进一步研究以及关于 FPC 发展和其早期检测的最佳策略的更精确临床假设提供了明确的方向。