Talebzadeh Mohammad, Zare-Mirakabad Fatemeh
Department of Mathematics and Computer Science, AmirKabir University of Technology, Tehran, Iran.
PLoS One. 2014 Feb 21;9(2):e89226. doi: 10.1371/journal.pone.0089226. eCollection 2014.
In computational methods, position weight matrices (PWMs) are commonly applied for transcription factor binding site (TFBS) prediction. Although these matrices are more accurate than simple consensus sequences to predict actual binding sites, they usually produce a large number of false positive (FP) predictions and so are impoverished sources of information. Several studies have employed additional sources of information such as sequence conservation or the vicinity to transcription start sites to distinguish true binding regions from random ones. Recently, the spatial distribution of modified nucleosomes has been shown to be associated with different promoter architectures. These aligned patterns can facilitate DNA accessibility for transcription factors. We hypothesize that using data from these aligned and periodic patterns can improve the performance of binding region prediction. In this study, we propose two effective features, "modified nucleosomes neighboring" and "modified nucleosomes occupancy", to decrease FP in binding site discovery. Based on these features, we designed a logistic regression classifier which estimates the probability of a region as a TFBS. Our model learned each feature based on Sp1 binding sites on Chromosome 1 and was tested on the other chromosomes in human CD4+T cells. In this work, we investigated 21 histone modifications and found that only 8 out of 21 marks are strongly correlated with transcription factor binding regions. To prove that these features are not specific to Sp1, we combined the logistic regression classifier with the PWM, and created a new model to search TFBSs on the genome. We tested the model using transcription factors MAZ, PU.1 and ELF1 and compared the results to those using only the PWM. The results show that our model can predict Transcription factor binding regions more successfully. The relative simplicity of the model and capability of integrating other features make it a superior method for TFBS prediction.
在计算方法中,位置权重矩阵(PWMs)通常用于转录因子结合位点(TFBS)预测。尽管这些矩阵在预测实际结合位点方面比简单的共有序列更准确,但它们通常会产生大量假阳性(FP)预测,因此是信息匮乏的来源。一些研究采用了其他信息来源,如序列保守性或与转录起始位点的距离,以区分真正的结合区域和随机区域。最近,已显示修饰核小体的空间分布与不同的启动子结构相关。这些排列模式可促进转录因子对DNA的可及性。我们假设使用这些排列和周期性模式的数据可以提高结合区域预测的性能。在本研究中,我们提出了两个有效特征,即“相邻修饰核小体”和“修饰核小体占有率”,以减少结合位点发现中的假阳性。基于这些特征,我们设计了一个逻辑回归分类器,用于估计一个区域作为TFBS的概率。我们的模型基于1号染色体上的Sp1结合位点学习每个特征,并在人类CD4+T细胞的其他染色体上进行了测试。在这项工作中,我们研究了21种组蛋白修饰,发现21种标记中只有8种与转录因子结合区域高度相关。为了证明这些特征并非Sp1所特有,我们将逻辑回归分类器与PWM相结合,并创建了一个新模型来搜索基因组上的TFBS。我们使用转录因子MAZ、PU.1和ELF1对该模型进行了测试,并将结果与仅使用PWM的结果进行了比较。结果表明,我们的模型能够更成功地预测转录因子结合区域。该模型相对简单且能够整合其他特征,使其成为TFBS预测的一种优越方法。