通过深度学习整合多组学数据，实现癌症预后的精准预测。

Integrating multi-omics data through deep learning for accurate cancer prognosis prediction.

机构信息

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510000, China.

Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, 510000, China.

出版信息

Comput Biol Med. 2021 Jul;134:104481. doi: 10.1016/j.compbiomed.2021.104481. Epub 2021 May 9.

DOI:10.1016/j.compbiomed.2021.104481

PMID:33989895

Abstract

BACKGROUND

Genomic information is nowadays widely used for precise cancer treatments. Since the individual type of omics data only represents a single view that suffers from data noise and bias, multiple types of omics data are required for accurate cancer prognosis prediction. However, it is challenging to effectively integrate multi-omics data due to the large number of redundant variables but relatively small sample size. With the recent progress in deep learning techniques, Autoencoder was used to integrate multi-omics data for extracting representative features. Nevertheless, the generated model is fragile from data noises. Additionally, previous studies usually focused on individual cancer types without making comprehensive tests on pan-cancer. Here, we employed the denoising Autoencoder to get a robust representation of the multi-omics data, and then used the learned representative features to estimate patients' risks.

RESULTS

By applying to 15 cancers from The Cancer Genome Atlas (TCGA), our method was shown to improve the C-index values over previous methods by 6.5% on average. Considering the difficulty to obtain multi-omics data in practice, we further used only mRNA data to fit the estimated risks by training XGboost models, and found the models could achieve an average C-index value of 0.627. As a case study, the breast cancer prognosis prediction model was independently tested on three datasets from the Gene Expression Omnibus (GEO), and shown able to significantly separate high-risk patients from low-risk ones (C-index>0.6, p-values<0.05). Based on the risk subgroups divided by our method, we identified nine prognostic markers highly associated with breast cancer, among which seven genes have been proved by literature review.

CONCLUSION

Our comprehensive tests indicated that we have constructed an accurate and robust framework to integrate multi-omics data for cancer prognosis prediction. Moreover, it is an effective way to discover cancer prognosis-related genes.

摘要

背景

如今，基因组信息被广泛用于精确的癌症治疗。由于个体类型的组学数据仅代表单一视角，存在数据噪声和偏差，因此需要多种类型的组学数据来准确预测癌症预后。然而，由于冗余变量数量众多，而样本量相对较小，因此有效地整合多组学数据具有挑战性。随着深度学习技术的最新进展，自动编码器被用于整合多组学数据以提取代表性特征。然而，生成的模型容易受到数据噪声的影响。此外，以前的研究通常侧重于个别癌症类型，而没有对泛癌进行全面测试。在这里，我们采用去噪自动编码器来获取多组学数据的稳健表示，然后使用学习到的代表性特征来估计患者的风险。

结果

通过应用于来自癌症基因组图谱（TCGA）的 15 种癌症，我们的方法在平均水平上比以前的方法提高了 6.5%的 C 指数值。考虑到在实践中难以获得多组学数据，我们进一步仅使用 mRNA 数据通过训练 XGboost 模型来拟合估计的风险，发现模型可以达到 0.627 的平均 C 指数值。作为案例研究，我们将乳腺癌预后预测模型独立地在三个来自基因表达综合数据库（GEO）的数据集上进行了测试，并发现该模型能够显著地将高危患者与低危患者区分开来（C 指数>0.6，p 值<0.05）。基于我们的方法划分的风险亚组，我们确定了与乳腺癌高度相关的九个预后标志物，其中七个基因已经通过文献综述得到证实。