Kundaje Anshul, Lianoglou Steve, Li Xuejing, Quigley David, Arias Marta, Wiggins Chris H, Zhang Li, Leslie Christina
Department of Computer Science, Center for Computational Learning Systems, Columbia University, New York, NY 10065, USA.
Ann N Y Acad Sci. 2007 Dec;1115:178-202. doi: 10.1196/annals.1407.020. Epub 2007 Oct 12.
Inferring gene regulatory networks from high-throughput genomic data is one of the central problems in computational biology. In this paper, we describe a predictive modeling approach for studying regulatory networks, based on a machine learning algorithm called MEDUSA. MEDUSA integrates promoter sequence, mRNA expression, and transcription factor occupancy data to learn gene regulatory programs that predict the differential expression of target genes. Instead of using clustering or correlation of expression profiles to infer regulatory relationships, MEDUSA determines condition-specific regulators and discovers regulatory motifs that mediate the regulation of target genes. In this way, MEDUSA meaningfully models biological mechanisms of transcriptional regulation. MEDUSA solves the problem of predicting the differential (up/down) expression of target genes by using boosting, a technique from statistical learning, which helps to avoid overfitting as the algorithm searches through the high-dimensional space of potential regulators and sequence motifs. Experimental results demonstrate that MEDUSA achieves high prediction accuracy on held-out experiments (test data), that is, data not seen in training. We also present context-specific analysis of MEDUSA regulatory programs for DNA damage and hypoxia, demonstrating that MEDUSA identifies key regulators and motifs in these processes. A central challenge in the field is the difficulty of validating reverse-engineered networks in the absence of a gold standard. Our approach of learning regulatory programs provides at least a partial solution for the problem: MEDUSA's prediction accuracy on held-out data gives a concrete and statistically sound way to validate how well the algorithm performs. With MEDUSA, statistical validation becomes a prerequisite for hypothesis generation and network building rather than a secondary consideration.
从高通量基因组数据推断基因调控网络是计算生物学的核心问题之一。在本文中,我们描述了一种基于名为MEDUSA的机器学习算法来研究调控网络的预测建模方法。MEDUSA整合启动子序列、mRNA表达和转录因子占用数据,以学习预测靶基因差异表达的基因调控程序。MEDUSA不是使用表达谱的聚类或相关性来推断调控关系,而是确定特定条件下的调节因子,并发现介导靶基因调控的调控基序。通过这种方式,MEDUSA有意义地模拟了转录调控的生物学机制。MEDUSA通过使用统计学习中的一种技术——提升法,解决了预测靶基因差异(上调/下调)表达的问题,这有助于在算法搜索潜在调节因子和序列基序的高维空间时避免过拟合。实验结果表明,MEDUSA在留出实验(测试数据)中,即在训练中未见过的数据上,实现了较高的预测准确率。我们还对MEDUSA在DNA损伤和缺氧情况下的调控程序进行了特定背景分析,表明MEDUSA识别出了这些过程中的关键调节因子和基序。该领域的一个核心挑战是在没有金标准的情况下验证反向工程网络的难度。我们学习调控程序的方法至少为该问题提供了部分解决方案:MEDUSA在留出数据上的预测准确率为验证算法的性能提供了一种具体且具有统计学意义的方法。有了MEDUSA,统计验证成为假设生成和网络构建的先决条件,而不是次要考虑因素。