Frech K, Herrmann G, Werner T
Institut für Säugetiergenetik, GSF-Forschungszentrum für Umwelt und Gesundheit mbH, Neuherberg, Germany.
Nucleic Acids Res. 1993 Apr 11;21(7):1655-64. doi: 10.1093/nar/21.7.1655.
We present a method to determine the location and extent of protein binding regions in nucleic acids by computer-assisted analysis of sequence data. The program ConsIndex establishes a library of consensus descriptions based on sequence sets containing known regulatory elements. These defined consensus descriptions are used by the program ConsInspector to predict binding sites in new sequences. We show the programs to correctly determine the significant regions involved in transcriptional control of seven sequence elements. The internal profile of relative variability of individual nucleotide positions within these regions paralleled experimental profiles of biological significance. Consensus descriptions are determined by employing an anchored alignment scheme, the results of which are then evaluated by a novel method which is superior to cluster algorithms. The alignment procedure is able to include several closely related sequences without biasing the consensus description. Moreover, the algorithm detects additional elements on the basis of a moderate distance correlation and is capable of discriminating between real binding sites and false positive matches. The software is well suited to cope with the frequent phenomenon of optional elements present in a subset of functionally similar sequences, while taking maximal advantage of the existing sequence data base. Since it requires only a minimum of seven sequences for a single element, it is applicable to a wide range of binding sites.
我们提出了一种通过对序列数据进行计算机辅助分析来确定核酸中蛋白质结合区域的位置和范围的方法。ConsIndex程序基于包含已知调控元件的序列集建立了一个共有描述库。ConsInspector程序使用这些定义的共有描述来预测新序列中的结合位点。我们展示了这些程序能够正确确定七个序列元件转录调控中涉及的重要区域。这些区域内各个核苷酸位置的相对变异性的内部概况与具有生物学意义的实验概况相似。共有描述是通过采用一种锚定比对方案来确定的,然后通过一种优于聚类算法的新方法对其结果进行评估。比对过程能够纳入几个密切相关的序列而不会使共有描述产生偏差。此外,该算法基于适度的距离相关性检测额外的元件,并且能够区分真实的结合位点和假阳性匹配。该软件非常适合处理功能相似序列子集中存在的可选元件这一常见现象,同时最大程度地利用现有的序列数据库。由于单个元件仅需要最少七个序列,所以它适用于广泛的结合位点。