Bailey Timothy L, Noble William Stafford
Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia.
Bioinformatics. 2003 Oct;19 Suppl 2:ii16-25. doi: 10.1093/bioinformatics/btg1054.
The regulatory machinery controlling gene expression is complex, frequently requiring multiple, simultaneous DNA-protein interactions. The rate at which a gene is transcribed may depend upon the presence or absence of a collection of transcription factors bound to the DNA near the gene. Locating transcription factor binding sites in genomic DNA is difficult because the individual sites are small and tend to occur frequently by chance. True binding sites may be identified by their tendency to occur in clusters, sometimes known as regulatory modules.
We describe an algorithm for detecting occurrences of regulatory modules in genomic DNA. The algorithm, called mcast, takes as input a DNA database and a collection of binding site motifs that are known to operate in concert. mcast uses a motif-based hidden Markov model with several novel features. The model incorporates motif-specific p-values, thereby allowing scores from motifs of different widths and specificities to be compared directly. The p-value scoring also allows mcast to only accept motif occurrences with significance below a user-specified threshold, while still assigning better scores to motif occurrences with lower p-values. mcast can search long DNA sequences, modeling length distributions between motifs within a regulatory module, but ignoring length distributions between modules. The algorithm produces a list of predicted regulatory modules, ranked by E-value. We validate the algorithm using simulated data as well as real data sets from fruitfly and human.
控制基因表达的调控机制很复杂,常常需要多个DNA - 蛋白质同时相互作用。一个基因的转录速率可能取决于与该基因附近DNA结合的一系列转录因子的存在与否。在基因组DNA中定位转录因子结合位点很困难,因为单个位点很小且往往会偶然频繁出现。真正的结合位点可通过它们在簇中出现的倾向来识别,这些簇有时被称为调控模块。
我们描述了一种用于检测基因组DNA中调控模块出现情况的算法。该算法名为mcast,它将一个DNA数据库和一组已知协同作用的结合位点基序作为输入。mcast使用了一种具有几个新特性的基于基序的隐马尔可夫模型。该模型纳入了基序特异性p值,从而允许直接比较不同宽度和特异性的基序得分。p值评分还允许mcast仅接受显著性低于用户指定阈值的基序出现情况,同时仍为p值较低的基序出现情况赋予更好的分数。mcast可以搜索长DNA序列,对调控模块内基序之间的长度分布进行建模,但忽略模块之间的长度分布。该算法生成一份按E值排序的预测调控模块列表。我们使用模拟数据以及果蝇和人类的真实数据集对该算法进行了验证。