Casimiro Ana C, Vinga Susana, Freitas Ana T, Oliveira Arlindo L
INESC-ID/IST, Rua Alves Redol, 9 1000-029 Lisboa, Portugal.
BMC Bioinformatics. 2008 Feb 7;9:89. doi: 10.1186/1471-2105-9-89.
Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially.
We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery.
We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
基序发现算法在利用计算高效的方法检测生物序列中的模式方面已有发展。然而,输出结果的后续分类仍存在一些局限性,这使得评估所发现基序的生物学意义变得困难。先前的工作已强调了DNA序列中基序存在位置偏差,这不仅可能表明该模式很重要,还能为这些模式优先出现的位置提供线索。
我们提议整合位置均匀性测试和过表达测试,以提高基序分类的准确性。使用人工数据,我们比较了三种不同的统计测试(卡方检验、柯尔莫哥洛夫-斯米尔诺夫检验和卡方自展法),以评估给定基序在基因启动子区域是否均匀出现。使用在该数据集中表现更佳的测试,我们进而研究了几种知名顺式调控元件在不同生物体(酿酒酵母、智人、黑腹果蝇、大肠杆菌和几种双子叶植物)启动子序列中的位置分布。结果表明位置保守性与转录机制相关。
我们得出结论,许多具有生物学相关性的基序在基因启动子区域呈异质分布,因此,不均匀性是生物学相关性的良好指标,可用于补充常用的过表达测试。在本文中,我们展示了酿酒酵母数据集所获得的结果。