Torkey Hanaa, Atlam Mostafa, El-Fishawy Nawal, Salem Hanaa
Computer Science & Engineering Department, Faculty of Electronic Engineering, Menoufia University, Menouf, Egypt.
Faculty of Engineering, Delta University for Science and Technology, Gamasa, Egypt.
PeerJ Comput Sci. 2021 Apr 21;7:e492. doi: 10.7717/peerj-cs.492. eCollection 2021.
Breast cancer is one of the major causes of mortality globally. Therefore, different Machine Learning (ML) techniques were deployed for computing survival and diagnosis. Survival analysis methods are used to compute survival probability and the most important factors affecting that probability. Most survival analysis methods are used to deal with clinical features (up to hundreds), hence applying survival analysis methods like cox regression on RNAseq microarray data with many features (up to thousands) is considered a major challenge.
In this paper, a novel approach applying autoencoder to reduce the number of features is proposed. Our approach works on features reconstruction, and removal of noise within the data and features with zero variance across the samples, which facilitates extraction of features with the highest variances (across the samples) that most influence the survival probabilities. Then, it estimates the survival probability for each patient by applying random survival forests and cox regression. Applying the autoencoder on thousands of features takes a long time, thus our model is applied to the Graphical Processing Unit (GPU) in order to speed up the process. Finally, the model is evaluated and compared with the existing models on three different datasets in terms of run time, concordance index, and calibration curve, and the most related genes to survival are discovered. Finally, the biological pathways and GO molecular functions are analyzed for these significant genes.
We fine-tuned our autoencoder model on RNA-seq data of three datasets to train the weights in our survival prediction model, then using different samples in each dataset for testing the model. The results show that the proposed AutoCox and AutoRandom algorithms based on our feature selection autoencoder approach have better concordance index results comparing the most recent deep learning approaches when applied to each dataset. Each gene resulting from our autoencoder model weight is computed. The weights show the degree of effect for each gene upon the survival probability. For instance, four of the most survival-related experimentally validated genes are on the top of our discovered genes weights list, including PTPRG, MYST1, BG683264, and AK094562 for the breast cancer gene expression dataset. Our approach improves the survival analysis in terms of speeding up the process, enhancing the prediction accuracy, and reducing the error rate in the estimated survival probability.
乳腺癌是全球主要的致死原因之一。因此,人们采用了不同的机器学习(ML)技术来进行生存分析和诊断。生存分析方法用于计算生存概率以及影响该概率的最重要因素。大多数生存分析方法用于处理临床特征(多达数百个),因此将像Cox回归这样的生存分析方法应用于具有许多特征(多达数千个)的RNA测序微阵列数据被认为是一项重大挑战。
本文提出了一种应用自动编码器来减少特征数量的新方法。我们的方法致力于特征重构,去除数据中的噪声以及样本间方差为零的特征,这有助于提取对生存概率影响最大的(样本间)方差最高的特征。然后,通过应用随机生存森林和Cox回归来估计每个患者的生存概率。将自动编码器应用于数千个特征需要很长时间,因此我们的模型应用于图形处理单元(GPU)以加速该过程。最后,在运行时间、一致性指数和校准曲线方面,在三个不同数据集上对该模型进行评估并与现有模型进行比较,并发现与生存最相关的基因。最后,对这些重要基因的生物途径和基因本体(GO)分子功能进行分析。
我们在三个数据集的RNA测序数据上对自动编码器模型进行微调,以训练生存预测模型中的权重,然后在每个数据集中使用不同的样本对模型进行测试。结果表明,与最新的深度学习方法相比,基于我们的特征选择自动编码器方法提出的AutoCox和AutoRandom算法在应用于每个数据集时具有更好的一致性指数结果。计算了由我们的自动编码器模型权重产生的每个基因。这些权重显示了每个基因对生存概率的影响程度。例如,在我们发现的基因权重列表顶部有四个经过实验验证的与生存最相关的基因,在乳腺癌基因表达数据集中包括PTPRG、MYST1、BG683264和AK094562。我们的方法在加速过程、提高预测准确性和降低估计生存概率的错误率方面改进了生存分析。