Kundaje Anshul, Middendorf Manuel, Shah Mihir, Wiggins Chris H, Freund Yoav, Leslie Christina
Department of Computer Science, Columbia University, New York, NY 10027, USA.
BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S5. doi: 10.1186/1471-2105-7-S1-S5.
We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.
In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.
Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from http://www.cs.columbia.edu/compbio/robust-geneclass.
我们最近引入了一种预测框架,用于使用一种名为GeneClass的新型监督学习算法研究简单生物体中的基因转录调控。GeneClass的灵感来自于这样一种假设,即在酿酒酵母等模式生物中,我们可以基于基因调控区域中结合位点子序列(“基序”)的存在以及实验中调节因子(如转录因子)的表达水平(“亲本”),学习一种决策规则,以预测特定微阵列实验中基因是上调还是下调。GeneClass将学习任务表述为一个分类问题——预测对应于微阵列测量中超出生物学和测量噪声水平的上调和下调的+1和-1标签。使用Adaboost算法,GeneClass学习以交替决策树形式的预测函数,这是决策树基于边际的推广。
在当前工作中,我们引入了GeneClass算法的一个新的、稳健的版本,该版本提高了稳定性和计算效率,产生了一个更具可扩展性和可靠性的预测模型。预测树稳定性的提高使我们能够引入一个用于生物学解释的详细后处理框架,包括个体和组目标基因分析,以揭示特定条件下的调控程序并提出信号通路。稳健的GeneClass使用一种新颖的稳定化的增强变体,该变体允许在树的节点处包含一组相关特征,而不是单个特征;通过这种方式,与单个最佳特征相关的生物学上重要的特征得以保留,而不是在下一轮增强中去相关并丢失。其他计算方面的进展包括对所有特征的损失函数进行快速矩阵计算,从而实现对大型数据集的可扩展性,以及使用弃权弱规则,这会产生一个更浅且更易于解释的树。我们还展示了如何将来自ChIP芯片实验的全基因组蛋白质-DNA结合数据纳入GeneClass算法,并对基因表达数据使用了改进的噪声模型。
利用稳健的GeneClass提高的可扩展性,我们在酵母环境应激数据集上进行了更大规模的实验,对所有基因进行训练和测试,并使用了一组全面的潜在调节因子。我们展示了学习到的预测树中特征稳定性的提高,并通过分析酵母中的两组基因——蛋白质伴侣以及Nrg1和Nrg2转录因子的一组假定靶标——展示了后处理框架的效用,并提出了关于它们转录和转录后调控的新假设。详细结果和稳健的GeneClass源代码可从http://www.cs.columbia.edu/compbio/robust-geneclass下载。