Computer Laboratory, University of Cambridge, Cambridge, United Kingdom.
PLoS One. 2012;7(9):e42489. doi: 10.1371/journal.pone.0042489. Epub 2012 Sep 11.
A major goal of bioinformatics is the characterization of transcription factors and the transcriptional programs they regulate. Given the speed of genome sequencing, we would like to quickly annotate regulatory sequences in newly-sequenced genomes. In such cases, it would be helpful to predict sequence motifs by using experimental data from closely related model organism. Here we present a general algorithm that allow to identify transcription factor binding sites in one newly sequenced species by performing Bayesian regression on the annotated species. First we set the rationale of our method by applying it within the same species, then we extend it to use data available in closely related species. Finally, we generalise the method to handle the case when a certain number of experiments, from several species close to the species on which to make inference, are available. In order to show the performance of the method, we analyse three functionally related networks in the Ascomycota. Two gene network case studies are related to the G2/M phase of the Ascomycota cell cycle; the third is related to morphogenesis. We also compared the method with MatrixReduce and discuss other types of validation and tests. The first network is well known and provides a biological validation test of the method. The two cell cycle case studies, where the gene network size is conserved, demonstrate an effective utility in annotating new species sequences using all the available replicas from model species. The third case, where the gene network size varies among species, shows that the combination of information is less powerful but is still informative. Our methodology is quite general and could be extended to integrate other high-throughput data from model organisms.
生物信息学的一个主要目标是对转录因子及其调控的转录程序进行特征描述。考虑到基因组测序的速度,我们希望能够快速注释新测序基因组中的调控序列。在这种情况下,如果能够利用来自密切相关的模式生物的实验数据来预测序列基序,将会很有帮助。在这里,我们提出了一种通用算法,通过在已注释的物种上进行贝叶斯回归,来识别新测序物种中的转录因子结合位点。首先,我们通过将该方法应用于同一物种内部来阐明其原理,然后将其扩展到使用密切相关物种中的可用数据。最后,我们将该方法推广到处理来自几个与推断物种接近的物种的一定数量的实验的情况。为了展示该方法的性能,我们分析了子囊菌门中的三个功能相关网络。两个基因网络案例研究与子囊菌细胞周期的 G2/M 期有关;第三个与形态发生有关。我们还将该方法与 MatrixReduce 进行了比较,并讨论了其他类型的验证和测试。第一个网络是众所周知的,为该方法提供了生物学验证测试。两个细胞周期案例研究中,基因网络的大小是保守的,这证明了使用来自模式物种的所有可用副本注释新物种序列的有效效用。在基因网络大小在物种之间变化的第三个案例中,信息的组合虽然不那么强大,但仍然具有信息性。我们的方法非常通用,可以扩展到整合来自模式生物的其他高通量数据。