Hassanzadeh Hamid Reza, Phan John H, Wang May D
Department of Computational Science and Engineering, Georgia Institute of Technology Atlanta, Georgia 30332.
Department of Biomedical Engineering Georgia Institute of Technology and Emory University, Atlanta, Georgia 30332.
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2016 Dec;2016:184-189. doi: 10.1109/bibm.2016.7822516. Epub 2017 Jan 19.
Cancer survival prediction is an active area of research that can help prevent unnecessary therapies and improve patient's quality of life. Gene expression profiling is being widely used in cancer studies to discover informative biomarkers that aid predict different clinical endpoint prediction. We use multiple modalities of data derived from RNA deep-sequencing (RNA-seq) to predict survival of cancer patients. Despite the wealth of information available in expression profiles of cancer tumors, fulfilling the aforementioned objective remains a big challenge, for the most part, due to the paucity of data samples compared to the high dimension of the expression profiles. As such, analysis of transcriptomic data modalities calls for state-of-the-art big-data analytics techniques that can maximally use all the available data to discover the relevant information hidden within a significant amount of noise. In this paper, we propose a pipeline that predicts cancer patients' survival by exploiting the structure of the input (manifold learning) and by leveraging the unlabeled samples using Laplacian support vector machines, a graph-based semi supervised learning (GSSL) paradigm. We show that under certain circumstances, no single modality per se will result in the best accuracy and by fusing different models together via a stacked generalization strategy, we may boost the accuracy synergistically. We apply our approach to two cancer datasets and present promising results. We maintain that a similar pipeline can be used for predictive tasks where labeled samples are expensive to acquire.
癌症生存预测是一个活跃的研究领域,它有助于避免不必要的治疗并提高患者的生活质量。基因表达谱分析在癌症研究中被广泛应用,以发现有助于预测不同临床终点的信息性生物标志物。我们使用从RNA深度测序(RNA-seq)获得的多种数据模式来预测癌症患者的生存情况。尽管癌症肿瘤的表达谱中存在大量信息,但实现上述目标仍然是一个巨大的挑战,在很大程度上是由于与高维表达谱相比,数据样本匮乏。因此,对转录组数据模式的分析需要最先进的大数据分析技术,这些技术可以最大限度地利用所有可用数据,以发现隐藏在大量噪声中的相关信息。在本文中,我们提出了一种流程,通过利用输入的结构(流形学习)并使用拉普拉斯支持向量机(一种基于图的半监督学习(GSSL)范式)来利用未标记样本,从而预测癌症患者的生存情况。我们表明,在某些情况下,单一模式本身不会带来最佳准确性,而通过堆叠泛化策略将不同模型融合在一起,我们可以协同提高准确性。我们将我们的方法应用于两个癌症数据集,并呈现出有前景的结果。我们认为,类似的流程可用于获取标记样本成本高昂的预测任务。