Holloway Dustin T, Kon Mark, DeLisi Charles
Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA 02215, USA.
Genome Inform. 2005;16(1):83-94.
Transcription factor binding sites (TFBS) in gene promoter regions are often predicted by using position specific scoring matrices (PSSMs), which summarize sequence patterns of experimentally determined TF binding sites. Although PSSMs are more reliable than simple consensus string matching in predicting a true binding site, they generally result in high numbers of false positive hits. This study attempts to reduce the number of false positive matches and generate new predictions by integrating various types of genomic data by two methods: a Bayesian allocation procedure, and support vector machine classification. Several methods will be explored to strengthen the prediction of a true TFBS in the Saccharomyces cerevisiae genome: binding site degeneracy, binding site conservation, phylogenetic profiling, TF binding site clustering, gene expression profiles, GO functional annotation, and k-mer counts in promoter regions. Binding site degeneracy (or redundancy) refers to the number of times a particular transcription factor's binding motif is discovered in the upstream region of a gene. Phylogenetic conservation takes into account the number of orthologous upstream regions in other genomes that contain a particular binding site. Phylogenetic profiling refers to the presence or absence of a gene across a large set of genomes. Binding site clusters are statistically significant clusters of TF binding sites detected by the algorithm ClusterBuster. Gene expression takes into account the idea that when the gene expression profiles of a transcription factor and a potential target gene are correlated, then it is more likely that the gene is a genuine target. Also, genes with highly correlated expression profiles are often regulated by the same TF(s). The GO annotation data takes advantage of the idea that common transcription targets often have related function. Finally, the distribution of the counts of all k-mers of length 4, 5, and 6 in gene's promoter region were examined as means to predict TF binding. In each case the data are compared to known true positives taken from ChIP-chip data, Transfac, and the Saccharomyces Genome Database. First, degeneracy, conservation, expression, and binding site clusters were examined independently and in combination via Bayesian allocation. Then, binding sites were predicted with a support vector machine (SVM) using all methods alone and in combination. The SVM works best when all genomic data are combined, but can also identify which methods contribute the most to accurate classification. On average, a support vector machine can classify binding sites with high sensitivity and an accuracy of almost 80%.
基因启动子区域中的转录因子结合位点(TFBS)通常通过使用位置特异性评分矩阵(PSSM)来预测,该矩阵总结了实验确定的TF结合位点的序列模式。尽管PSSM在预测真正的结合位点方面比简单的共有序列匹配更可靠,但它们通常会导致大量的假阳性命中。本研究试图通过两种方法整合各种类型的基因组数据来减少假阳性匹配的数量并生成新的预测:贝叶斯分配程序和支持向量机分类。将探索几种方法来加强对酿酒酵母基因组中真正TFBS的预测:结合位点简并性、结合位点保守性、系统发育谱分析、TF结合位点聚类、基因表达谱、GO功能注释以及启动子区域中的k-mer计数。结合位点简并性(或冗余性)是指在基因上游区域中发现特定转录因子结合基序的次数。系统发育保守性考虑了其他基因组中包含特定结合位点的直系同源上游区域的数量。系统发育谱分析是指在一大组基因组中基因的存在或缺失情况。结合位点聚类是由算法ClusterBuster检测到的TF结合位点的具有统计学意义的聚类。基因表达考虑了这样一种观点,即当转录因子和潜在靶基因的基因表达谱相关时,那么该基因更有可能是真正的靶标。此外,具有高度相关表达谱的基因通常受相同的TF调控。GO注释数据利用了共同转录靶标通常具有相关功能这一观点。最后,检查基因启动子区域中长度为4、5和6的所有k-mer的计数分布,作为预测TF结合的手段。在每种情况下,将数据与从ChIP-chip数据、Transfac和酿酒酵母基因组数据库中获取的已知真阳性进行比较。首先,通过贝叶斯分配独立地并结合起来检查简并性、保守性、表达和结合位点聚类。然后,使用支持向量机(SVM)单独和结合所有方法来预测结合位点。当所有基因组数据结合在一起时,SVM的效果最佳,但它也可以识别出对准确分类贡献最大的方法。平均而言,支持向量机可以以高灵敏度和近80%的准确率对结合位点进行分类。