Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany.
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab392.
Deep neural networks are frequently employed to predict survival conditional on omics-type biomarkers, e.g., by employing the partial likelihood of Cox proportional hazards model as loss function. Due to the generally limited number of observations in clinical studies, combining different data sets has been proposed to improve learning of network parameters. However, if baseline hazards differ between the studies, the assumptions of Cox proportional hazards model are violated. Based on high dimensional transcriptome profiles from different tumor entities, we demonstrate how using a stratified partial likelihood as loss function allows for accounting for the different baseline hazards in a deep learning framework. Additionally, we compare the partial likelihood with the ranking loss, which is frequently employed as loss function in machine learning approaches due to its seemingly simplicity. Using RNA-seq data from the Cancer Genome Atlas (TCGA) we show that use of stratified loss functions leads to an overall better discriminatory power and lower prediction error compared to their non-stratified counterparts. We investigate which genes are identified to have the greatest marginal impact on prediction of survival when using different loss functions. We find that while similar genes are identified, in particular known prognostic genes receive higher importance from stratified loss functions. Taken together, pooling data from different sources for improved parameter learning of deep neural networks benefits largely from employing stratified loss functions that consider potentially varying baseline hazards. For easy application, we provide PyTorch code for stratified loss functions and an explanatory Jupyter notebook in a GitHub repository.
深度神经网络常用于基于组学类型生物标志物预测生存情况,例如,通过将 Cox 比例风险模型的部分似然作为损失函数。由于临床研究中观察到的数量通常有限,因此已经提出了组合不同数据集的方法来改善网络参数的学习。但是,如果研究之间的基线风险不同,则 Cox 比例风险模型的假设就会被违反。基于来自不同肿瘤实体的高维转录组谱,我们展示了如何在深度学习框架中使用分层部分似然作为损失函数来考虑不同的基线风险。此外,我们比较了部分似然与排名损失,由于其看似简单,排名损失经常被用作机器学习方法中的损失函数。使用来自癌症基因组图谱 (TCGA) 的 RNA-seq 数据,我们表明,使用分层损失函数可与非分层损失函数相比,整体上具有更好的判别能力和更低的预测误差。我们研究了在使用不同损失函数时,哪些基因被确定对生存预测有最大的边际影响。我们发现,虽然确定了类似的基因,但分层损失函数赋予了已知预后基因更高的重要性。总之,从不同来源汇集数据以改善深度神经网络的参数学习,从采用考虑潜在变化基线风险的分层损失函数中受益匪浅。为了便于应用,我们在 GitHub 存储库中提供了用于分层损失函数的 PyTorch 代码和解释性 Jupyter 笔记本。