Liu Qian, Hu Pingzhao
Department of Biochemistry and Medical Genetics, College of Medicine, Faculty of Health Sciences, University of Manitoba, Winnipeg, MB R3E 0J9, Canada.
Research Institute in Oncology and Hematology, CancerCare Manitoba, Winnipeg, MB R3E 0V9, Canada.
Cancers (Basel). 2019 Apr 7;11(4):494. doi: 10.3390/cancers11040494.
Artificial intelligence-based unsupervised deep learning (DL) is widely used to mine multimodal big data. However, there are few applications of this technology to cancer genomics. We aim to develop DL models to extract deep features from the breast cancer gene expression data and copy number alteration (CNA) data separately and jointly. We hypothesize that the deep features are associated with patients' clinical characteristics and outcomes. Two unsupervised denoising autoencoders (DAs) were developed to extract deep features from TCGA (The Cancer Genome Atlas) breast cancer gene expression and CNA data separately and jointly. A heat map was used to view and cluster patients into subgroups based on these DL features. Fisher's exact test and Pearson' Chi-square test were applied to test the associations of patients' groups and clinical information. Survival differences between the groups were evaluated by Kaplan⁻Meier (KM) curves. Associations between each of the features and patient's overall survival were assessed using Cox's proportional hazards (COX-PH) model and a risk score for each feature set from the different omics data sets was generated from the survival regression coefficients. The risk scores for each feature set were binarized into high- and low-risk patient groups to evaluate survival differences using KM curves. Furthermore, the risk scores were traced back to their gene level DAs weights so that the three gene lists for each of the genomic data points were generated to perform gene set enrichment analysis. Patients were clustered into two groups based on concatenated features from the gene expression and CNA data and these two groups showed different overall survival rates (-value = 0.049) and different ER (Estrogen receptor) statuses (-value = 0.002, OR (odds ratio) = 0.626). All the risk scores from the gene expression and CNA data and their concatenated one were significantly associated with breast cancer survival. The patients with the high-risk group were significantly associated with patients' worse outcomes (-values ≤ 0.0023). The concatenated risk score was enriched by the AMP-activated protein kinase (AMPK) signaling pathway, the regulation of DNA-templated transcription, the regulation of nucleic acid-templated transcription, the regulation of apoptotic process, the positive regulation of gene expression, the positive regulation of cell proliferation, heart morphogenesis, the regulation of cellular macromolecule biosynthetic process, with FDR (false discovery rate) less than 0.05. We confirmed DAs can effectively extract meaningful genomic features from genomic data and concatenating multiple data sources can improve the significance of the features associated with breast cancer patients' clinical characteristics and outcomes.
基于人工智能的无监督深度学习(DL)被广泛用于挖掘多模态大数据。然而,这项技术在癌症基因组学中的应用却很少。我们旨在开发DL模型,分别从乳腺癌基因表达数据和拷贝数变异(CNA)数据中单独或联合提取深度特征。我们假设这些深度特征与患者的临床特征和预后相关。我们开发了两个无监督去噪自编码器(DA),分别从TCGA(癌症基因组图谱)乳腺癌基因表达和CNA数据中单独或联合提取深度特征。利用热图基于这些DL特征查看患者并将其聚类为亚组。应用Fisher精确检验和Pearson卡方检验来检验患者分组与临床信息之间的关联。通过Kaplan-Meier(KM)曲线评估各组之间的生存差异。使用Cox比例风险(COX-PH)模型评估每个特征与患者总生存之间的关联,并根据生存回归系数为来自不同组学数据集的每个特征集生成风险评分。将每个特征集的风险评分二值化为高风险和低风险患者组,以使用KM曲线评估生存差异。此外,将风险评分追溯到其基因水平的DA权重,从而生成每个基因组数据点的三个基因列表以进行基因集富集分析。根据基因表达和CNA数据的串联特征将患者聚类为两组,这两组显示出不同的总生存率(P值 = 0.049)和不同的雌激素受体(ER)状态(P值 = 0.002,优势比(OR) = 0.626)。来自基因表达和CNA数据及其串联数据的所有风险评分均与乳腺癌生存显著相关。高风险组患者与较差的预后显著相关(P值≤0.0023)。串联风险评分在AMP激活的蛋白激酶(AMPK)信号通路、DNA模板转录调控、核酸模板转录调控、凋亡过程调控、基因表达的正调控、细胞增殖的正调控、心脏形态发生、细胞大分子生物合成过程调控中富集,错误发现率(FDR)小于0.05。我们证实DA可以有效地从基因组数据中提取有意义的基因组特征,并且串联多个数据源可以提高与乳腺癌患者临床特征和预后相关特征的显著性。