通过生存分析评估乳腺癌的关键预测因素：AFT脆弱模型与LASSO、岭回归和弹性网络正则化的比较

Evaluating key predictors of breast cancer through survival: a comparison of AFT frailty models with LASSO, ridge, and elastic net regularization.

作者信息

Bosson-Amedenu Senyefia, Ayitey Emmanuel, Ayiah-Mensah Francis, Asare Luyton

机构信息

Department of Mathematics, Statistics and Actuarial Science, Takoradi Technical University, Sekondi-Takoradi, Ghana.

出版信息

BMC Cancer. 2025 Apr 11;25(1):665. doi: 10.1186/s12885-025-14040-z.

DOI:10.1186/s12885-025-14040-z

PMID:40217202

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11987402/

Abstract

BACKGROUND

Frailty models are extensively utilized in survival analysis to address unobserved heterogeneity among individuals. However, selecting the most robust model for survival prediction, especially in the context of high-dimensional data, continues to pose a challenge. This study evaluates the performance of various Accelerated Failure Time (AFT) frailty models and examines the influence of regularization techniques, including LASSO, Ridge, and Elastic Net, on model selection and prediction accuracy.

METHODS

We utilized both simulated datasets and a real breast cancer dataset to compare the performance of seven Accelerated Failure Time (AFT) frailty models: Weibull, Log-logistic, Gamma, Gompertz, Log-normal, Generalized Gamma, and the Extreme Value Frailty AFT model. Model performance was evaluated using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE) metrics across three sample sizes (25%, 50%, and 75%). To enhance parameter estimation and reduce overfitting in high-dimensional survival data, we applied regularization methods, including LASSO, Ridge, and Elastic Net. The Extreme Value Frailty AFT model consistently outperformed all other models across various sample sizes, demonstrating the lowest values for AIC, BIC, MAE, and MSE. These results indicate its superior fit and predictive accuracy. The forest plot analysis further validates the strong impact of significant covariates. The model's AIC ranged from 100.41 at a 25% sample size to 384.58 at a 75% sample size, consistently surpassing the performance of the second-best Log-logistic model. Furthermore, the application of LASSO regularization improved the model's parsimony by eliminating non-informative covariates, such as Age, PR, and Hospitalization, while retaining essential predictors like Competing Risks, Metastasis, Stage, and Lymph Node involvement.

CONCLUSION

The Extreme Value Frailty Accelerated Failure Time (AFT) model demonstrated strong predictive performance in survival analysis, particularly when combined with LASSO regularization to enhance interpretability and generalizability. Key predictors-including Comorbidity, Metastasis, Stage, and Lymph Node involvement-remained significant after regularization, with reduced coefficients. Notably, patients without metastasis had 2.63 times longer expected survival than those with metastatic disease, while lower-stage diagnoses and minimal lymph node involvement contributed to 26% and 16% longer survival times, respectively. Other significant factors included recurrence status (19% increase in survival), HER2 negativity (20% longer survival), absence of the Triple Negative subtype (15% longer survival), and lower tumor grades (11% longer survival).By effectively shrinking less relevant variables, LASSO mitigated overfitting while preserving critical predictors, reinforcing the importance of tumor characteristics and molecular markers in survival outcomes. The study highlights the crucial role of risk stratification, as patients categorized into Low, Medium, and High-risk groups exhibit distinct survival patterns, aligning with the Extreme Value AFT Frailty Model. The forest plot analysis further validates the strong impact of significant covariates, with Competing Risks, Lymph Node Involvement, and Metastasis emerging as the most critical prognostic factors. Kaplan-Meier survival analysis reveals sharp survival declines associated with metastasis, lymph node involvement, tumor grade, HER2 status, and molecular subtypes, reinforcing the urgent need for early detection and targeted interventions. Notably, patients with Triple Negative and HER2-overexpressing subtypes exhibit the poorest survival outcomes, highlighting the necessity for subtype-specific therapies. Additionally, competing risks, particularly hospitalization-related factors, substantially impact survival, emphasizing the need for integrated treatment approaches.These findings emphasize the role of advanced statistical techniques in improving survival predictions, providing valuable insights that can enhance clinical decision-making in breast cancer prognosis and broader medical research.

摘要

背景

脆弱模型在生存分析中被广泛应用，以解决个体间未观察到的异质性问题。然而，选择最稳健的生存预测模型，尤其是在高维数据的背景下，仍然是一个挑战。本研究评估了各种加速失效时间（AFT）脆弱模型的性能，并考察了正则化技术（包括LASSO、岭回归和弹性网络）对模型选择和预测准确性的影响。

方法

我们使用模拟数据集和真实乳腺癌数据集，比较了七种加速失效时间（AFT）脆弱模型的性能：威布尔模型、对数逻辑斯蒂模型、伽马模型、冈珀茨模型、对数正态模型、广义伽马模型和极值脆弱AFT模型。在三种样本量（25%、50%和75%）下，使用赤池信息准则（AIC）、贝叶斯信息准则（BIC）、平均绝对误差（MAE）和均方误差（MSE）指标评估模型性能。为了在高维生存数据中增强参数估计并减少过拟合，我们应用了正则化方法，包括LASSO、岭回归和弹性网络。在各种样本量下，极值脆弱AFT模型始终优于所有其他模型，其AIC、BIC、MAE和MSE值最低。这些结果表明其拟合优度和预测准确性更高。森林图分析进一步验证了显著协变量的强烈影响。该模型的AIC在25%样本量时为100.41，在75%样本量时为384.58，始终超过第二优的对数逻辑斯蒂模型的性能。此外，LASSO正则化的应用通过消除诸如年龄、孕激素受体（PR）和住院等无信息的协变量，提高了模型的简约性，同时保留了诸如竞争风险、转移、分期和淋巴结受累等重要预测因子。

结论

极值脆弱加速失效时间（AFT）模型在生存分析中表现出强大的预测性能，特别是与LASSO正则化相结合时，可提高可解释性和泛化能力。包括合并症、转移、分期和淋巴结受累在内的关键预测因子在正则化后仍然显著，但其系数有所降低。值得注意的是，无转移患者的预期生存期比有转移疾病的患者长2.63倍，而较低分期的诊断和最小的淋巴结受累分别使生存期延长26%和16%。其他显著因素包括复发状态（生存期增加19%）、人表皮生长因子受体2（HER2）阴性（生存期延长20%）、无三阴性亚型（生存期延长15%）和较低的肿瘤分级（生存期延长11%）。通过有效收缩不太相关的变量，LASSO减轻了过拟合，同时保留了关键预测因子，强化了肿瘤特征和分子标志物在生存结局中的重要性。该研究强调了风险分层的关键作用，因为分为低、中、高风险组的患者表现出不同的生存模式，与极值AFT脆弱模型一致。森林图分析进一步验证了显著协变量的强烈影响，其中竞争风险、淋巴结受累和转移成为最关键的预后因素。 Kaplan-Meier生存分析揭示了与转移、淋巴结受累、肿瘤分级、HER2状态和分子亚型相关的生存率急剧下降，强化了早期检测和靶向干预的迫切需求。值得注意的是，三阴性和HER2过表达亚型的患者生存结局最差，凸显了亚型特异性治疗的必要性。此外，竞争风险，特别是与住院相关的因素，对生存有重大影响，强调了综合治疗方法的必要性。这些发现强调了先进统计技术在改善生存预测方面的作用，提供了有价值的见解，可增强乳腺癌预后及更广泛医学研究中的临床决策。