Tatusov R L, Altschul S F, Koonin E V
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.
Proc Natl Acad Sci U S A. 1994 Dec 6;91(25):12091-5. doi: 10.1073/pnas.91.25.12091.
We describe an approach to analyzing protein sequence databases that, starting from a single uncharacterized sequence or group of related sequences, generates blocks of conserved segments. The procedure involves iterative database scans with an evolving position-dependent weight matrix constructed from a coevolving set of aligned conserved segments. For each iteration, the expected distribution of matrix scores under a random model is used to set a cutoff score for the inclusion of a segment in the next iteration. This cutoff may be calculated to allow the chance inclusion of either a fixed number or a fixed proportion of false positive segments. With sufficiently high cutoff scores, the procedure converged for all alignment blocks studied, with varying numbers of iterations required. Different methods for calculating weight matrices from alignment blocks were compared. The most effective of those tested was a logarithm-of-odds, Bayesian-based approach that used prior residue probabilities calculated from a mixture of Dirichlet distributions. The procedure described was used to detect novel conserved motifs of potential biological importance.
我们描述了一种分析蛋白质序列数据库的方法,该方法从单个未表征的序列或一组相关序列开始,生成保守片段块。该过程涉及使用从一组共同进化的比对保守片段构建的不断演变的位置依赖权重矩阵进行迭代数据库扫描。对于每次迭代,使用随机模型下矩阵分数的预期分布来设置截止分数,以便在下一次迭代中纳入一个片段。可以计算此截止分数,以允许以固定数量或固定比例随机纳入假阳性片段。在足够高的截止分数下,该过程对于所有研究的比对块都收敛,所需的迭代次数各不相同。比较了从比对块计算权重矩阵的不同方法。测试中最有效的方法是基于贝叶斯的对数优势方法,该方法使用从狄利克雷分布混合计算出的先验残基概率。所描述的过程用于检测具有潜在生物学重要性的新型保守基序。