McGill Centre for Bioinformatics and School of Computer Science, McGill University, H3C 2B4 Québec, Canada.
BMC Bioinformatics. 2012;13 Suppl 19(Suppl 19):S2. doi: 10.1186/1471-2105-13-S19-S2. Epub 2012 Dec 19.
The computational prediction of Transcription Factor Binding Sites (TFBS) remains a challenge due to their short length and low information content. Comparative genomics approaches that simultaneously consider several related species and favor sites that have been conserved throughout evolution improve the accuracy (specificity) of the predictions but are limited due to a phenomenon called binding site turnover, where sequence evolution causes one TFBS to replace another in the same region. In parallel to this development, an increasing number of mammalian genomes are now sequenced and it is becoming possible to infer, to a surprisingly high degree of accuracy, ancestral mammalian sequences.
We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy. This method aims to identify binding loci, which are regions of a few hundred base pairs that have preserved their potential to bind a given transcription factor over evolutionary time. After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence. The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates.
The source code and predictions made by the program are available at http://www.cs.mcgill.ca/~blanchem/bindingLoci.
转录因子结合位点(TFBS)的计算预测仍然是一个挑战,因为它们的长度短,信息含量低。同时考虑几个相关物种的比较基因组学方法,并有利于在进化过程中保守的位点,可以提高预测的准确性(特异性),但由于称为结合位点转换的现象而受到限制,其中序列进化导致一个 TFBS 在同一区域取代另一个 TFBS。随着越来越多的哺乳动物基因组被测序,人们可以以令人惊讶的高精度推断出祖先哺乳动物的序列,这种情况也在平行发展。
我们提出了一种 TFBS 预测方法,利用可获得的推断出的祖先哺乳动物基因组来提高其准确性。该方法旨在识别结合基因座,这些基因座是几百个碱基对的区域,在进化过程中保留了与给定转录因子结合的潜力。在提出了一种预测 TFBS 在给定长度的 DNA 区域中计数的中性进化模型之后,我们使用它来识别那些以意想不到的程度保留了它们所包含的预测 TFBS 数量的区域,考虑到它们的分歧。该方法应用于人类染色体 1,并与现有的单物种和多物种 TFBS 预测方法相比,显示出显著提高的准确性,特别是对于那些经历高周转率的转录因子。
程序生成的源代码和预测结果可在 http://www.cs.mcgill.ca/~blanchem/bindingLoci 上获得。