Lawrence C E, Reilly A A
Biometrics Laboratory, Wadsworth Center for Laboratories and Research, New York State Department of Health, Albany 12201.
Proteins. 1990;7(1):41-51. doi: 10.1002/prot.340070105.
Statistical methodology for the identification and characterization of protein binding sites in a set of unaligned DNA fragments is presented. Each sequence must contain at least one common site. No alignment of the sites is required. Instead, the uncertainty in the location of the sites is handled by employing the missing information principle to develop an "expectation maximization" (EM) algorithm. This approach allows for the simultaneous identification of the sites and characterization of the binding motifs. The reliability of the algorithm increases with the number of fragments, but the computations increase only linearly. The method is illustrated with an example, using known cyclic adenosine monophosphate receptor protein (CRP) binding sites. The final motif is utilized in a search for undiscovered CRP binding sites.
本文介绍了用于识别和表征一组未比对DNA片段中蛋白质结合位点的统计方法。每个序列必须至少包含一个共同位点。不需要对这些位点进行比对。相反,通过运用缺失信息原理开发一种“期望最大化”(EM)算法来处理位点位置的不确定性。这种方法允许同时识别位点并表征结合基序。该算法的可靠性随着片段数量的增加而提高,但计算量仅呈线性增加。通过一个使用已知环磷酸腺苷受体蛋白(CRP)结合位点的例子对该方法进行了说明。最终的基序被用于搜索未发现的CRP结合位点。