Wang Shuo, Zhang Hao, Liu Zhen, Liu Yuanning
College of Computer Science and Technology, Jilin University, Changchun, China.
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
Front Genet. 2022 Mar 14;13:800853. doi: 10.3389/fgene.2022.800853. eCollection 2022.
Lung cancer is the leading cause of the cancer deaths. Therefore, predicting the survival status of lung cancer patients is of great value. However, the existing methods mainly depend on statistical machine learning (ML) algorithms. Moreover, they are not appropriate for high-dimensionality genomics data, and deep learning (DL), with strong high-dimensional data learning capability, can be used to predict lung cancer survival using genomics data. The Cancer Genome Atlas (TCGA) is a great database that contains many kinds of genomics data for 33 cancer types. With this enormous amount of data, researchers can analyze key factors related to cancer therapy. This paper proposes a novel method to predict lung cancer long-term survival using gene expression data from TCGA. Firstly, we select the most relevant genes to the target problem by the supervised feature selection method called mutual information selector. Secondly, we propose a method to convert gene expression data into two kinds of images with KEGG BRITE and KEGG Pathway data incorporated, so that we could make good use of the convolutional neural network (CNN) model to learn high-level features. Afterwards, we design a CNN-based DL model and added two kinds of clinical data to improve the performance, so that we finally got a multimodal DL model. The generalized experiments results indicated that our method performed much better than the ML models and unimodal DL models. Furthermore, we conduct survival analysis and observe that our model could better divide the samples into high-risk and low-risk groups.
肺癌是癌症死亡的主要原因。因此,预测肺癌患者的生存状况具有重要价值。然而,现有方法主要依赖于统计机器学习(ML)算法。此外,它们不适用于高维基因组数据,而具有强大高维数据学习能力的深度学习(DL)可用于利用基因组数据预测肺癌生存情况。癌症基因组图谱(TCGA)是一个大型数据库,包含33种癌症类型的多种基因组数据。借助这些海量数据,研究人员可以分析与癌症治疗相关的关键因素。本文提出一种利用TCGA基因表达数据预测肺癌长期生存的新方法。首先,我们通过名为互信息选择器的监督特征选择方法选择与目标问题最相关的基因。其次,我们提出一种方法,将基因表达数据转换为两种融入KEGG BRITE和KEGG通路数据的图像,以便我们能够充分利用卷积神经网络(CNN)模型学习高级特征。之后,我们设计了一个基于CNN的DL模型,并添加两种临床数据以提高性能,最终得到一个多模态DL模型。广义实验结果表明,我们的方法比ML模型和单模态DL模型表现要好得多。此外,我们进行了生存分析,观察到我们的模型能够更好地将样本分为高风险组和低风险组。