Department of Statistics and Genetics Institute, University of Florida, Gainesville, Florida, USA.
Department of Statistics, University of Florida, Gainesville, Florida, USA.
Biometrics. 2023 Sep;79(3):1610-1623. doi: 10.1111/biom.13736. Epub 2022 Oct 17.
We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional predictors. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that each of the hazard functions depend only on a small number of linear combinations of the predictors (i.e., "factors"). We estimate these linear combinations using an algorithm based on "distance-to-set" penalties. This allows us to impose both low-rankness and sparsity on the regression coefficient matrix estimator. We derive asymptotic results that reveal that our estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, to validate our fitted model, we show that our estimated factors can lead to better prediction than competitors on four external datasets.
我们提出了一种约束极大似然估计方法,用于在具有高维预测因子的综合(例如,泛癌)生存分析中进行降维。我们假设在研究中的每个群体中,风险函数遵循一个独特的 Cox 比例风险模型。为了跨群体借用信息,我们假设每个风险函数仅依赖于预测因子的少数几个线性组合(即“因子”)。我们使用基于“距离到集合”惩罚的算法来估计这些线性组合。这允许我们对回归系数矩阵估计器施加低秩和稀疏性。我们推导出的渐近结果表明,我们的估计器比为每个群体拟合单独的比例风险模型更有效。数值实验表明,我们的方法在各种数据生成模型下均优于竞争对手。我们使用该方法对与 18 种不同癌症类型的蛋白质表达与生存相关的泛癌生存分析。我们的方法确定了六个仅依赖于 20 种蛋白质的线性组合,可解释癌症类型之间的生存情况。最后,为了验证我们拟合的模型,我们证明我们估计的因子在四个外部数据集上的预测效果优于竞争对手。