在线生存分析框架：重新构建用于大数据集和神经网络的 Cox 比例风险模型。

An online framework for survival analysis: reframing Cox proportional hazards model for large data sets and neural networks.

机构信息

Department of Biostatistics, Hans Rosling Center for Population Health, Box 351617, University of Washington Seattle, WA 98195, USA.

出版信息

Biostatistics. 2023 Dec 15;25(1):134-153. doi: 10.1093/biostatistics/kxac039.

DOI:10.1093/biostatistics/kxac039

PMID:36288541

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10724274/

Abstract

In many biomedical applications, outcome is measured as a "time-to-event" (e.g., disease progression or death). To assess the connection between features of a patient and this outcome, it is common to assume a proportional hazards model and fit a proportional hazards regression (or Cox regression). To fit this model, a log-concave objective function known as the "partial likelihood" is maximized. For moderate-sized data sets, an efficient Newton-Raphson algorithm that leverages the structure of the objective function can be employed. However, in large data sets this approach has two issues: (i) The computational tricks that leverage structure can also lead to computational instability; (ii) The objective function does not naturally decouple: Thus, if the data set does not fit in memory, the model can be computationally expensive to fit. This additionally means that the objective is not directly amenable to stochastic gradient-based optimization methods. To overcome these issues, we propose a simple, new framing of proportional hazards regression: This results in an objective function that is amenable to stochastic gradient descent. We show that this simple modification allows us to efficiently fit survival models with very large data sets. This also facilitates training complex, for example, neural-network-based, models with survival data.

摘要

在许多生物医学应用中，结果被测量为“事件时间”（例如，疾病进展或死亡）。为了评估患者特征与该结果之间的关系，通常假设比例风险模型并拟合比例风险回归（或 Cox 回归）。为了拟合该模型，最大化称为“部分似然”的对数凹目标函数。对于中等大小的数据集，可以使用利用目标函数结构的有效牛顿-拉普森算法。然而，在大数据集中，这种方法有两个问题：（i）利用结构的计算技巧也可能导致计算不稳定；（ii）目标函数不能自然地解耦：因此，如果数据集无法适应内存，则拟合模型的计算成本可能会很高。这还意味着目标函数不能直接适用于基于随机梯度的优化方法。为了解决这些问题，我们提出了一种简单的新比例风险回归框架：这导致了一个可通过随机梯度下降来拟合的目标函数。我们表明，这种简单的修改允许我们有效地拟合具有非常大数据集的生存模型。这还促进了使用生存数据训练复杂的，例如基于神经网络的模型。