Liu Kuang-Yu, Zhou Xiaobo, Kan Kinhong, Wong Stephen T C
HCNR -- Center for Bioinformatics, Harvard Medical School, Boston, Massachusetts 02215, USA.
Neuroinformatics. 2006 Winter;4(1):95-117. doi: 10.1385/NI:4:1:95.
Multiple transcription factors (TFs) coordinately control transcriptional regulation of genes in eukaryotes. Although numerous computational methods focus on the identification of individual TF-binding sites (TFBSs), very few consider the interdependence among these sites. In this article, we studied the relationship between TFBSs and microarray gene expression levels using both family-wise and memberspecific motifs, under various combination of regression models with Bayesian variable selection, as well as motif scoring and sharing conditions, in order to account for the coordination complexity of transcription regulation. We proposed a three-step approach to model the relationship. In the first step, we preprocessed microarray data and used p-values and expression ratios to preselect upregulated and downregulated genes. The second step aimed to identify and score individual TFBSs within DNA sequence of each gene. A method based on the degree of similarity and the number of TFBSs was employed to calculate the score of each TFBS in each gene sequence. In the last step, linear regression and probit regression were used to build a predictive model of gene expression outcomes using these TFBSs as predictors. Given a certain number of predictors to be used, a full search of all possible predictor sets is usually combinatorially prohibitive. Therefore, this article considered the Bayesian variable selection for prediction using either of the regression models. The Bayesian variable selection has been applied in the context of gene selection, missing value estimation, and regulatory motif identification. In our modeling, the regressor was approximated as a linear combination of the TFBSs and a Gibbs sampler was employed to find the strongest TFBSs. We applied these regression models with the Bayesian variable selection on spinal cord injury gene expression data set. These TFs demonstrated intricate regulatory roles either as a family or as individual members in neuroinflammatory events. Our analysis can be applied to create plausible hypotheses for combinatorial regulation by TFBSs and avoiding false-positive candidates in the modeling process at the same time. Such a systematic approach provides the possibility to dissect transcription regulation, from a more comprehensive perspective, through which phenotypical events at cellular and tissue levels are moved forward by molecular events at gene transcription and translation levels.
多种转录因子(TFs)协同控制真核生物中基因的转录调控。尽管众多计算方法专注于识别单个转录因子结合位点(TFBSs),但很少有方法考虑这些位点之间的相互依赖性。在本文中,我们使用家族特异性和成员特异性基序,在回归模型与贝叶斯变量选择的各种组合以及基序评分和共享条件下,研究了TFBSs与微阵列基因表达水平之间的关系,以解释转录调控的协调复杂性。我们提出了一种三步法来对这种关系进行建模。第一步,我们对微阵列数据进行预处理,并使用p值和表达比率预先选择上调和下调基因。第二步旨在识别每个基因的DNA序列中的单个TFBSs并对其进行评分。采用一种基于相似度和TFBSs数量的方法来计算每个基因序列中每个TFBS的得分。在最后一步中,使用线性回归和概率单位回归,以这些TFBSs作为预测因子构建基因表达结果的预测模型。给定要使用的一定数量的预测因子,对所有可能的预测因子集进行全面搜索通常在组合上是不可行的。因此,本文考虑使用回归模型之一进行预测的贝叶斯变量选择。贝叶斯变量选择已应用于基因选择、缺失值估计和调控基序识别等背景中。在我们的建模中,回归因子被近似为TFBSs的线性组合,并使用吉布斯采样器来找到最强的TFBSs。我们将这些带有贝叶斯变量选择的回归模型应用于脊髓损伤基因表达数据集。这些转录因子在神经炎症事件中作为一个家族或作为个体成员发挥着复杂的调控作用。我们的分析可用于为TFBSs的组合调控创建合理的假设,同时在建模过程中避免假阳性候选。这种系统方法提供了从更全面的角度剖析转录调控的可能性,通过这种方式,细胞和组织水平的表型事件由基因转录和翻译水平的分子事件推动。