Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
Department of Pediatrics, UPMC Children's Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
J Comput Biol. 2021 May;28(5):485-500. doi: 10.1089/cmb.2020.0438.
Gene expression profiling makes it possible to conduct many biological studies in a variety of fields due to its thorough characterization of cellular states under various experimental conditions. Despite recent advances in high-throughput technology, profiling an entire set of genomes is still difficult and expensive. Due to the high correlation between expression patterns of different genes, the aforementioned problem can be solved with a cost-effective approach that collects only a small subset of genes, called landmark genes, representing the entire set of genes, and infer the remaining genes, called target genes, using a computational model. There are several shallow and deep regression models in literature to estimate the expressions of target genes from the landmark genes. However, the shallow mostly have limited capacity in learning the nonlinear and complex gene expression data and are prone to underfitting, and the deep models generally do not take advantage of correlation among target genes in the learning process and suffer from overfitting. Considering the gene expression inference as a multitask learning problem, we propose a new deep multitask learning algorithm to tackle these issues. Our learning framework automatically learns the correlation between target genes and uses this knowledge to improve its generalization. Specifically, we utilize a subnetwork with low-dimensional latent variables to discover the relationships between target genes and enforce a seamless and easy to implement regularization to our deep regression model. Unlike the existing multitask learning methods that can only deal with dozens or hundreds of tasks, our algorithm is able to efficiently learn the relationships between ∼10,000 target genes and, thus, is scalable to a large number of tasks. Our proposed method outperforms the shallow and deep regression models for gene expression inference and alternative multitask learning algorithms on two large-scale datasets regardless of the network architecture.
基因表达谱分析通过全面描述各种实验条件下的细胞状态,使得在多个领域进行许多生物学研究成为可能。尽管高通量技术最近取得了进展,但对整个基因组进行分析仍然困难且昂贵。由于不同基因表达模式之间存在高度相关性,可以采用经济有效的方法来解决上述问题,该方法仅收集一小部分基因作为地标基因来代表整个基因集,并使用计算模型来推断其余基因,即目标基因。文献中有几种浅层和深层回归模型可用于从地标基因估计目标基因的表达。然而,浅层模型在学习非线性和复杂的基因表达数据方面能力有限,容易出现欠拟合,而深层模型在学习过程中通常无法利用目标基因之间的相关性,容易出现过拟合。考虑到基因表达推断是一个多任务学习问题,我们提出了一种新的深度多任务学习算法来解决这些问题。我们的学习框架自动学习目标基因之间的相关性,并利用这种知识来提高其泛化能力。具体来说,我们利用具有低维潜在变量的子网来发现目标基因之间的关系,并对我们的深度回归模型施加无缝且易于实现的正则化。与现有的多任务学习方法只能处理数十个或数百个任务不同,我们的算法能够有效地学习 10000 个左右目标基因之间的关系,因此可以扩展到大量任务。无论网络架构如何,我们的方法在两个大规模数据集上的基因表达推断和替代多任务学习算法上都优于浅层和深层回归模型。