College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China.
Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
BMC Bioinformatics. 2021 Jan 22;22(1):27. doi: 10.1186/s12859-021-03972-5.
Currently, large-scale gene expression profiling has been successfully applied to the discovery of functional connections among diseases, genetic perturbation, and drug action. To address the cost of an ever-expanding gene expression profile, a new, low-cost, high-throughput reduced representation expression profiling method called L1000 was proposed, with which one million profiles were produced. Although a set of ~ 1000 carefully chosen landmark genes that can capture ~ 80% of information from the whole genome has been identified for use in L1000, the robustness of using these landmark genes to infer target genes is not satisfactory. Therefore, more efficient computational methods are still needed to deep mine the influential genes in the genome.
Here, we propose a computational framework based on deep learning to mine a subset of genes that can cover more genomic information. Specifically, an AutoEncoder framework is first constructed to learn the non-linear relationship between genes, and then DeepLIFT is applied to calculate gene importance scores. Using this data-driven approach, we have re-obtained a landmark gene set. The result shows that our landmark genes can predict target genes more accurately and robustly than that of L1000 based on two metrics [mean absolute error (MAE) and Pearson correlation coefficient (PCC)]. This reveals that the landmark genes detected by our method contain more genomic information.
We believe that our proposed framework is very suitable for the analysis of biological big data to reveal the mysteries of life. Furthermore, the landmark genes inferred from this study can be used for the explosive amplification of gene expression profiles to facilitate research into functional connections.
目前,大规模基因表达谱分析已成功应用于发现疾病、遗传扰动和药物作用之间的功能联系。为了解决基因表达谱不断扩大的成本问题,提出了一种新的、低成本、高通量的简化表达谱分析方法 L1000,可以生成一百万种谱图。虽然已经确定了一组约 1000 个精心挑选的地标基因,这些基因可以捕获整个基因组约 80%的信息,用于 L1000,但使用这些地标基因推断靶基因的稳健性并不令人满意。因此,仍然需要更有效的计算方法来深入挖掘基因组中的有影响的基因。
在这里,我们提出了一个基于深度学习的计算框架,用于挖掘可以覆盖更多基因组信息的基因子集。具体来说,首先构建了一个自动编码器框架来学习基因之间的非线性关系,然后应用 DeepLIFT 来计算基因重要性得分。通过这种数据驱动的方法,我们重新获得了一个地标基因集。结果表明,我们的地标基因可以比 L1000 更准确和稳健地预测靶基因,这两个指标分别是均方误差 (MAE) 和皮尔逊相关系数 (PCC)。这表明我们的方法检测到的地标基因包含更多的基因组信息。
我们相信,我们提出的框架非常适合分析生物大数据,以揭示生命的奥秘。此外,从这项研究中推断出的地标基因可以用于基因表达谱的爆炸式扩增,以促进对功能联系的研究。