Biomedical Data Science and Informatics Program, Clemson University, Clemson, SC United States of America.
Genetics and Biochemistry Department, Clemson University, Clemson, SC United States of America.
PLoS One. 2019 Aug 6;14(8):e0220279. doi: 10.1371/journal.pone.0220279. eCollection 2019.
Gene co-expression networks (GCNs) are constructed from Gene Expression Matrices (GEMs) in a bottom up approach where all gene pairs are tested for correlation within the context of the input sample set. This approach is computationally intensive for many current GEMs and may not be scalable to millions of samples. Further, traditional GCNs do not detect non-linear relationships missed by correlation tests and do not place genetic relationships in a gene expression intensity context. In this report, we propose EdgeScaping, which constructs and analyzes the pairwise gene intensity network in a holistic, top down approach where no edges are filtered. EdgeScaping uses a novel technique to convert traditional pairwise gene expression data to an image based format. This conversion not only performs feature compression, making our algorithm highly scalable, but it also allows for exploring non-linear relationships between genes by leveraging deep learning image analysis algorithms. Using the learned embedded feature space we implement a fast, efficient algorithm to cluster the entire space of gene expression relationships while retaining gene expression intensity. Since EdgeScaping does not eliminate conventionally noisy edges, it extends the identification of co-expression relationships beyond classically correlated edges to facilitate the discovery of novel or unusual expression patterns within the network. We applied EdgeScaping to a human tumor GEM to identify sets of genes that exhibit conventional and non-conventional interdependent non-linear behavior associated with brain specific tumor sub-types that would be eliminated in conventional bottom-up construction of GCNs. Edgescaping source code is available at https://github.com/bhusain/EdgeScaping under the MIT license.
基因共表达网络(GCN)是通过从基因表达矩阵(GEM)中采用自下而上的方法构建的,其中所有基因对都在输入样本集的上下文中测试相关性。对于许多当前的 GEM,这种方法在计算上是密集的,并且可能不适用于数百万个样本。此外,传统的 GCN 无法检测到相关性测试错过的非线性关系,并且无法将遗传关系置于基因表达强度的背景下。在本报告中,我们提出了 EdgeScaping,它采用整体的自上而下的方法构建和分析成对的基因强度网络,其中不过滤任何边缘。EdgeScaping 使用一种新颖的技术将传统的成对基因表达数据转换为基于图像的格式。这种转换不仅执行特征压缩,使我们的算法具有高度可扩展性,而且还可以通过利用深度学习图像分析算法来探索基因之间的非线性关系。使用学习到的嵌入式特征空间,我们实现了一种快速、高效的算法来对整个基因表达关系空间进行聚类,同时保留基因表达强度。由于 EdgeScaping 不消除传统上嘈杂的边缘,因此它扩展了共表达关系的识别范围,超越了经典相关边缘,以促进网络中新型或不寻常表达模式的发现。我们将 EdgeScaping 应用于人类肿瘤 GEM,以识别出与脑特异性肿瘤亚型相关的表现出常规和非常规相互依赖的非线性行为的基因集,这些基因集在传统的 GCN 自下而上构建中会被消除。EdgeScaping 的源代码可在 https://github.com/bhusain/EdgeScaping 上获得,遵循 MIT 许可证。