Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland 4072, Australia.
Bioinformatics. 2011 Sep 1;27(17):2354-60. doi: 10.1093/bioinformatics/btr399. Epub 2011 Jun 30.
Direct binding by a transcription factor (TF) to the proximal promoter of a gene is a strong evidence that the TF regulates the gene. Assaying the genome-wide binding of every TF in every cell type and condition is currently impractical. Histone modifications correlate with tissue/cell/condition-specific ('tissue specific') TF binding, so histone ChIP-seq data can be combined with traditional position weight matrix (PWM) methods to make tissue-specific predictions of TF-promoter interactions.
We use supervised learning to train a naïve Bayes predictor of TF-promoter binding. The predictor's features are the histone modification levels and a PWM-based score for the promoter. Training and testing uses sets of promoters labeled using TF ChIP-seq data, and we use cross-validation on 23 such datasets to measure the accuracy. A PWM+histone naïve Bayes predictor using a single histone modification (H3K4me3) is substantially more accurate than a PWM score or a conservation-based score (phylogenetic motif model). The naïve Bayes predictor is more accurate (on average) at all sensitivity levels, and makes only half as many false positive predictions at sensitivity levels from 10% to 80%. On average, it correctly predicts 80% of bound promoters at a false positive rate of 20%. Accuracy does not diminish when we test the predictor in a different cell type (and species) from training. Accuracy is barely diminished even when we train the predictor without using TF ChIP-seq data.
Our tissue-specific predictor of promoters bound by a TF is called Dr Gene and is available at http://bioinformatics.org.au/drgene.
Supplementary data are available at Bioinformatics online.
转录因子(TF)与基因近端启动子的直接结合是 TF 调节基因的有力证据。目前,检测每个 TF 在每种细胞类型和条件下的全基因组结合是不切实际的。组蛋白修饰与组织/细胞/条件特异性(“组织特异性”)TF 结合相关,因此组蛋白 ChIP-seq 数据可以与传统的位置权重矩阵(PWM)方法结合使用,以对 TF-启动子相互作用进行组织特异性预测。
我们使用有监督学习来训练 TF-启动子结合的朴素贝叶斯预测器。预测器的特征是组蛋白修饰水平和基于 PWM 的启动子评分。训练和测试使用使用 TF ChIP-seq 数据标记的启动子集,我们使用 23 个此类数据集的交叉验证来衡量准确性。使用单个组蛋白修饰(H3K4me3)的 PWM+组蛋白朴素贝叶斯预测器比 PWM 评分或基于保守性的评分(系统发育基序模型)更准确。在所有灵敏度水平上,朴素贝叶斯预测器都更准确(平均而言),并且在灵敏度水平为 10%至 80%时,假阳性预测的数量减少了一半。平均而言,在假阳性率为 20%的情况下,它可以正确预测 80%的结合启动子。当我们在与训练不同的细胞类型(和物种)中测试预测器时,准确性不会降低。即使我们在不使用 TF ChIP-seq 数据的情况下训练预测器,准确性也几乎没有降低。
我们的 TF 结合的启动子的组织特异性预测器称为 Dr Gene,可在 http://bioinformatics.org.au/drgene 上获得。
补充数据可在 Bioinformatics 在线获得。