通过多条序列具有统计学意义的比对来识别DNA和蛋白质模式。

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

作者信息

Hertz G Z, Stormo G D

机构信息

Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO 80309-0347, USA.

出版信息

Bioinformatics. 1999 Jul-Aug;15(7-8):563-77. doi: 10.1093/bioinformatics/15.7.563.

DOI:10.1093/bioinformatics/15.7.563

PMID:10487864

Abstract

MOTIVATION

Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme.

RESULTS

We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

AVAILABILITY

Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.

摘要

动机

分子生物学家常常通过比对一组相关的DNA、RNA或蛋白质序列来获得有趣的见解。此类比对可用于确定进化关系或功能关系。我们感兴趣的是识别功能关系。除非序列非常相似，否则必须有一个特定的策略来衡量——或评分——比对序列的相关性。如果比对未知，可以通过找到优化评分方案的比对来确定。

结果

我们描述了用于确定多序列比对的方法的四个组成部分。首先，我们回顾一种我们称为信息含量的对数似然评分方案。其次，我们描述两种估计单个信息含量得分P值的方法：（i）一种将大偏差统计技术与数值计算相结合的方法；（ii）一种完全是数值的方法。第三，我们描述在给定序列数据总量的情况下如何计算可能的比对数量。这个数量乘以P值以确定信息含量得分的预期频率，从而确定相应比对的统计显著性。统计显著性可用于比较具有不同宽度和包含不同数量序列的比对。第四，我们描述一种用于确定功能相关序列比对的贪心算法。最后，我们测试了我们P值计算的准确性，并给出了一个使用我们的算法识别大肠杆菌CRP蛋白结合位点的例子。