Stormo G D, Hartzell G W
Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder 80309.
Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183-7. doi: 10.1073/pnas.86.4.1183.
The ability to determine important features within DNA sequences from the sequences alone is becoming essential as large-scale sequencing projects are being undertaken. We present a method that can be applied to the problem of identifying the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments, each known to contain somewhere within it a binding site for that protein. Information about the position or orientation of the binding sites within those fragments is not needed. The method compares the "information content" of a large number of possible binding site alignments to arrive at a matrix representation of the binding site pattern. The specificity of the protein is represented as a matrix, rather than a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required increases only linearly with the number of sequences. An example, using known cAMP receptor protein-binding sites, illustrates the method.
随着大规模测序项目的开展,仅从DNA序列本身确定其中重要特征的能力变得至关重要。我们提出了一种方法,该方法可应用于仅给定一组已测序的DNA片段来识别DNA结合蛋白识别模式的问题,已知每个片段内部某处都包含该蛋白的一个结合位点。不需要关于这些片段内结合位点的位置或方向的信息。该方法比较大量可能的结合位点比对的“信息含量”,以得出结合位点模式的矩阵表示。蛋白质的特异性表示为矩阵,而不是共有序列,从而能够识别调节蛋白结合位点典型的模式。该方法的可靠性随着序列数量的增加而提高,但所需时间仅与序列数量呈线性增加。使用已知的cAMP受体蛋白结合位点的一个例子说明了该方法。