Zheng Wei-Mou
Institute of Theoretical Physics, Academia Sinica Beijing 100080, China.
Bioinformatics. 2005 Apr 1;21(7):938-43. doi: 10.1093/bioinformatics/bti090. Epub 2004 Oct 28.
The discovery of patterns shared by several sequences that differ greatly is a basic task in sequence analysis, and still a challenge. Several methods have been developed for detecting patterns. Methods commonly used for motif search include the Gibbs sampler, Expectation-Maximization (EM) algorithm and some intuitive greedy approaches. One cannot guarantee the optimality of the result produced by the Gibbs sampler in a single run. The deterministic EM methods tend to get trapped by local optima. Solutions found by greedy approaches are rarely sufficiently good.
A simple model describing a motif or a portion of local multiple sequence alignment is the weight matrix model, in which a motif is characterized with position-specific probabilities. Two substitution matrices are proposed to relate the sequence similarity with the weight matrix. Combining the substitution matrix and weight matrix, we examine three typical sets of protein sequences with increasing complexity. At a low score threshold for pair similarity, sliding windows are compared with a seed window to find the score sum, which provides a measure of statistical significance for multiple sequence comparison. Such a similarity analysis reveals many aspects of motifs. Blocks determined by similarity can be used to deduce a primary weight matrix or an improved substitution matrix. The algorithm successfully obtains the optimal solution for the test sets by just greedy iteration.
发现几个差异很大的序列所共有的模式是序列分析中的一项基本任务,但仍然是一个挑战。已经开发了几种用于检测模式的方法。常用于基序搜索的方法包括吉布斯采样器、期望最大化(EM)算法和一些直观的贪心方法。单次运行吉布斯采样器无法保证其产生结果的最优性。确定性的EM方法容易陷入局部最优。贪心方法找到的解决方案很少足够好。
描述基序或局部多序列比对一部分的一个简单模型是权重矩阵模型,其中基序由位置特异性概率来表征。提出了两个替换矩阵来关联序列相似性和权重矩阵。结合替换矩阵和权重矩阵,我们研究了三组复杂度不断增加的典型蛋白质序列。在成对相似性的低得分阈值下,将滑动窗口与种子窗口进行比较以找到得分总和,这为多序列比较提供了统计显著性的一种度量。这样的相似性分析揭示了基序的许多方面。由相似性确定的模块可用于推导一个主要的权重矩阵或一个改进的替换矩阵。该算法仅通过贪心迭代就成功地为测试集获得了最优解。