Chen Yifei, Li Yi, Narayan Rajiv, Subramanian Aravind, Xie Xiaohui
Department of Computer Science, University of California, Irvine, CA 92697, USA Baidu Research-Big Data Lab, Beijing, 100085, China.
Department of Computer Science, University of California, Irvine, CA 92697, USA.
Bioinformatics. 2016 Jun 15;32(12):1832-9. doi: 10.1093/bioinformatics/btw074. Epub 2016 Feb 11.
Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes.
We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes.
D-GEX is available at https://github.com/uci-cbcl/D-GEX CONTACT: xhx@ics.uci.edu
Supplementary data are available at Bioinformatics online.
大规模基因表达谱分析已被广泛用于表征细胞在各种疾病状态、基因扰动等情况下的状态。尽管全基因组表达谱的成本一直在稳步下降,但生成数千个样本的表达谱汇编仍然非常昂贵。认识到基因表达通常高度相关,美国国立卫生研究院(NIH)LINCS项目的研究人员开发了一种经济高效的策略,仅对约1000个精心挑选的标志性基因进行谱分析,并依靠计算方法来推断其余目标基因的表达。然而,LINCS项目采用的计算方法目前基于线性回归(LR),由于它没有捕捉到基因表达之间复杂的非线性关系,限制了其准确性。
我们提出了一种深度学习方法(简称为D-GEX),用于从标志性基因的表达推断目标基因的表达。我们使用基于微阵列的基因表达综合数据集(由111K个表达谱组成)来训练我们的模型,并将其性能与其他方法进行比较。就所有基因的平均绝对误差而言,深度学习显著优于LR,相对提高了15.33%。基因层面的比较分析表明,深度学习在99.97%的目标基因中实现了比LR更低的误差。我们还在一个基于RNA-Seq的独立GTEx数据集(由2921个表达谱组成)上测试了我们学习到的模型的性能。深度学习仍然优于LR,相对提高了6.57%,并在81.31%的目标基因中实现了更低的误差。
D-GEX可在https://github.com/uci-cbcl/D-GEX获取 联系方式:xhx@ics.uci.edu
补充数据可在《生物信息学》在线获取。