New York University, New York, NY 10003, USA.
Center for Computational Biology, Flatiron Institute, New York, NY 10010, USA.
PLoS Comput Biol. 2019 Jan 24;15(1):e1006591. doi: 10.1371/journal.pcbi.1006591. eCollection 2019 Jan.
Gene regulatory networks are composed of sub-networks that are often shared across biological processes, cell-types, and organisms. Leveraging multiple sources of information, such as publicly available gene expression datasets, could therefore be helpful when learning a network of interest. Integrating data across different studies, however, raises numerous technical concerns. Hence, a common approach in network inference, and broadly in genomics research, is to separately learn models from each dataset and combine the results. Individual models, however, often suffer from under-sampling, poor generalization and limited network recovery. In this study, we explore previous integration strategies, such as batch-correction and model ensembles, and introduce a new multitask learning approach for joint network inference across several datasets. Our method initially estimates the activities of transcription factors, and subsequently, infers the relevant network topology. As regulatory interactions are context-dependent, we estimate model coefficients as a combination of both dataset-specific and conserved components. In addition, adaptive penalties may be used to favor models that include interactions derived from multiple sources of prior knowledge including orthogonal genomics experiments. We evaluate generalization and network recovery using examples from Bacillus subtilis and Saccharomyces cerevisiae, and show that sharing information across models improves network reconstruction. Finally, we demonstrate robustness to both false positives in the prior information and heterogeneity among datasets.
基因调控网络由通常在生物过程、细胞类型和生物体中共享的子网组成。因此,在学习感兴趣的网络时,利用多个信息源,如公开的基因表达数据集,可能会有所帮助。然而,整合来自不同研究的数据会引发许多技术问题。因此,在网络推断中,以及在广义的基因组学研究中,一种常见的方法是分别从每个数据集学习模型并组合结果。然而,单个模型经常存在采样不足、泛化能力差和网络恢复有限的问题。在这项研究中,我们探索了先前的整合策略,如批量校正和模型集成,并引入了一种新的多任务学习方法,用于跨多个数据集进行联合网络推断。我们的方法最初估计转录因子的活性,然后推断相关的网络拓扑结构。由于调节相互作用是上下文相关的,我们将模型系数估计为数据集特定和保守成分的组合。此外,自适应惩罚可用于支持包括来自多个先前知识来源(包括正交基因组实验)的相互作用的模型,这些来源的信息被整合在一起。我们使用枯草芽孢杆菌和酿酒酵母的示例来评估泛化和网络恢复情况,并表明跨模型共享信息可以改善网络重建。最后,我们证明了对先验信息中的假阳性和数据集之间的异质性都具有鲁棒性。