School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China.
Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, 510000, China.
Comput Biol Med. 2021 Jul;134:104481. doi: 10.1016/j.compbiomed.2021.104481. Epub 2021 May 9.
BACKGROUND: Genomic information is nowadays widely used for precise cancer treatments. Since the individual type of omics data only represents a single view that suffers from data noise and bias, multiple types of omics data are required for accurate cancer prognosis prediction. However, it is challenging to effectively integrate multi-omics data due to the large number of redundant variables but relatively small sample size. With the recent progress in deep learning techniques, Autoencoder was used to integrate multi-omics data for extracting representative features. Nevertheless, the generated model is fragile from data noises. Additionally, previous studies usually focused on individual cancer types without making comprehensive tests on pan-cancer. Here, we employed the denoising Autoencoder to get a robust representation of the multi-omics data, and then used the learned representative features to estimate patients' risks. RESULTS: By applying to 15 cancers from The Cancer Genome Atlas (TCGA), our method was shown to improve the C-index values over previous methods by 6.5% on average. Considering the difficulty to obtain multi-omics data in practice, we further used only mRNA data to fit the estimated risks by training XGboost models, and found the models could achieve an average C-index value of 0.627. As a case study, the breast cancer prognosis prediction model was independently tested on three datasets from the Gene Expression Omnibus (GEO), and shown able to significantly separate high-risk patients from low-risk ones (C-index>0.6, p-values<0.05). Based on the risk subgroups divided by our method, we identified nine prognostic markers highly associated with breast cancer, among which seven genes have been proved by literature review. CONCLUSION: Our comprehensive tests indicated that we have constructed an accurate and robust framework to integrate multi-omics data for cancer prognosis prediction. Moreover, it is an effective way to discover cancer prognosis-related genes.
背景:如今,基因组信息被广泛用于精确的癌症治疗。由于个体类型的组学数据仅代表单一视角,存在数据噪声和偏差,因此需要多种类型的组学数据来准确预测癌症预后。然而,由于冗余变量数量众多,而样本量相对较小,因此有效地整合多组学数据具有挑战性。随着深度学习技术的最新进展,自动编码器被用于整合多组学数据以提取代表性特征。然而,生成的模型容易受到数据噪声的影响。此外,以前的研究通常侧重于个别癌症类型,而没有对泛癌进行全面测试。在这里,我们采用去噪自动编码器来获取多组学数据的稳健表示,然后使用学习到的代表性特征来估计患者的风险。
结果:通过应用于来自癌症基因组图谱(TCGA)的 15 种癌症,我们的方法在平均水平上比以前的方法提高了 6.5%的 C 指数值。考虑到在实践中难以获得多组学数据,我们进一步仅使用 mRNA 数据通过训练 XGboost 模型来拟合估计的风险,发现模型可以达到 0.627 的平均 C 指数值。作为案例研究,我们将乳腺癌预后预测模型独立地在三个来自基因表达综合数据库(GEO)的数据集上进行了测试,并发现该模型能够显著地将高危患者与低危患者区分开来(C 指数>0.6,p 值<0.05)。基于我们的方法划分的风险亚组,我们确定了与乳腺癌高度相关的九个预后标志物,其中七个基因已经通过文献综述得到证实。
结论:我们的综合测试表明,我们已经构建了一个准确而稳健的框架,用于整合多组学数据以进行癌症预后预测。此外,这是一种发现癌症预后相关基因的有效方法。
Comput Biol Med. 2021-7
BMC Med Inform Decis Mak. 2020-9-15
Clin Cancer Res. 2017-10-5
Brief Bioinform. 2025-7-2
Sci Data. 2025-5-30
NPJ Precis Oncol. 2025-5-6