Rollo Cesare, Pancotti Corrado, Sartori Flavio, Caranzano Isabella, D'Amico Saverio, Carota Luciana, Casadei Francesco, Birolo Giovanni, Lanino Luca, Sauta Elisabetta, Asti Gianluca, Buizza Alessandro, Delleani Mattia, Zazzetti Elena, Bicchieri Marilena, Maggioni Giulia, Fenaux Pierre, Platzbecker Uwe, Diez-Campelo Maria, Haferlach Torsten, Castellani Gastone, Della Porta Matteo Giovanni, Fariselli Piero, Sanavia Tiziana
Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy.
IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy; Train s.r.l., via Alessandro Manzoni 56, 20089 Rozzano - Milan, Italy.
Comput Methods Programs Biomed. 2025 Apr;261:108605. doi: 10.1016/j.cmpb.2025.108605. Epub 2025 Jan 20.
Several computational pipelines for biomedical data have been proposed to stratify patients and to predict their prognosis through survival analysis. However, these analyses are usually performed independently, without integrating the information derived from each of them. Clustering of survival data is an underexplored problem, and current approaches are limited for biomedical applications, whose data are usually heterogeneous and multimodal, with poor scalability for high-dimensionality.
We introduce VAE-Surv, a multimodal computational framework for patients' stratification and prognosis prediction. VAE-Surv integrates a Variational Autoencoder (VAE), which reduces the high-dimensional space characterizing the molecular data, with a deep survival model, which combines the embedded information with the clinical features. The VAE embedding step prioritizes local coherence within the feature space to detect potential nonlinear relationships among the molecular markers. The latent representation is then exploited to perform K-means clustering. To test the clinical robustness of the algorithm, VAE-Surv was applied to the Genomed4all cohort of Myelodysplastic Syndromes (MDS), comparing the identified subtypes with the World Health Organization (WHO) classification. The survival outcome was compared with the state-of-the-art Cox model and its penalized versions. Finally, to assess the generalizability of the results, the method was also validated on an external MDS cohort.
Tested on 2,043 patients in the GenomMed4All cohort, VAE-Surv achieved a median C-index of 0.78, outperforming classical approaches. In addition, the latent space enhanced the clustering performance compared to a traditional approach that applies the clustering directly to the input data. Compared to the WHO 2016 MDS subtypes, the analysis of the identified clusters showed that the proposed framework can capture existing clinical categorizations while also suggesting novel, data-driven patient groups. Even tested in an external MDS cohort of 2,384 patients, VAE-Surv achieved a good prediction performance (median C-index=0.74), preserving the interpretability of the main clinical and genetic features.
VAE-Surv enables automatic identification of patients' clusters, while outperforming the traditional CoxPH model in survival prediction tasks at the same time. Applied to MDS use case, the obtained genetic-based clusters exhibit a clear survival stratification, and the application of the clinical information allowed high performance in prognosis prediction.
已经提出了几种用于生物医学数据的计算流程,以对患者进行分层,并通过生存分析预测其预后。然而,这些分析通常是独立进行的,没有整合从每个分析中获得的信息。生存数据的聚类是一个未被充分探索的问题,当前的方法在生物医学应用中存在局限性,因为生物医学数据通常是异质的和多模态的,对于高维数据的可扩展性较差。
我们引入了VAE-Surv,这是一种用于患者分层和预后预测的多模态计算框架。VAE-Surv将一个变分自编码器(VAE)与一个深度生存模型相结合,VAE用于降低表征分子数据的高维空间,深度生存模型则将嵌入信息与临床特征相结合。VAE嵌入步骤优先考虑特征空间内的局部连贯性,以检测分子标记之间潜在的非线性关系。然后利用潜在表示进行K均值聚类。为了测试该算法的临床稳健性,将VAE-Surv应用于骨髓增生异常综合征(MDS)的Genomed4all队列,将识别出的亚型与世界卫生组织(WHO)分类进行比较。将生存结果与最先进的Cox模型及其惩罚版本进行比较。最后,为了评估结果的可推广性,该方法还在一个外部MDS队列上进行了验证。
在GenomMed4All队列中的2043名患者上进行测试时,VAE-Surv的中位C指数达到了0.78,优于传统方法。此外,与直接将聚类应用于输入数据的传统方法相比,潜在空间增强了聚类性能。与WHO 2016 MDS亚型相比,对识别出的聚类进行分析表明,所提出的框架能够捕捉现有的临床分类,同时还能提出新的数据驱动的患者群体。即使在一个包含2384名患者的外部MDS队列中进行测试,VAE-Surv也取得了良好的预测性能(中位C指数=0.74),同时保留了主要临床和遗传特征的可解释性。
VAE-Surv能够自动识别患者聚类,同时在生存预测任务中优于传统的CoxPH模型。应用于MDS用例时,所获得的基于基因的聚类表现出明显的生存分层,临床信息的应用在预后预测中具有高性能。