iGibbs：通过序列聚类和迭代模式采样改进蛋白质的吉布斯基序采样器

iGibbs: improving Gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling.

作者信息

Kim Sun, Wang Zhiping, Dalkilic Mehmet

机构信息

School of Informatics, Indiana University, Indiana 47408, USA.

出版信息

Proteins. 2007 Feb 15;66(3):671-81. doi: 10.1002/prot.21153.

DOI:10.1002/prot.21153

PMID:17120229

Abstract

The motif prediction problem is to predict short, conserved subsequences that are part of a family of sequences, and it is a very important biological problem. Gibbs is one of the first successful motif algorithms and it runs very fast compared with other algorithms, and its search behavior is based on the well-studied Gibbs random sampling. However, motif prediction is a very difficult problem and Gibbs may not predict true motifs in some cases. Thus, the authors explored a possibility of improving the prediction accuracy of Gibbs while retaining its fast runtime performance. In this paper, the authors considered Gibbs only for proteins, not for DNA binding sites. The authors have developed iGibbs, an integrated motif search framework for proteins that employs two previous techniques of their own: one for guiding motif search by clustering sequences and another by pattern refinement. These two techniques are combined to a new double clustering approach to guiding motif search. The unique feature of their framework is that users do not have to specify the number of motifs to be predicted when motifs occur in different subsets of the input sequences since it automatically clusters input sequences into clusters and predict motifs from the clusters. Tests on the PROSITE database show that their framework improved the prediction accuracy of Gibbs significantly. Compared with more exhaustive search methods like MEME, iGibbs predicted motifs more accurately and runs one order of magnitude faster.

摘要

基序预测问题是预测作为序列家族一部分的短保守子序列，这是一个非常重要的生物学问题。吉布斯算法是最早成功的基序算法之一，与其他算法相比，它运行速度非常快，其搜索行为基于经过充分研究的吉布斯随机抽样。然而，基序预测是一个非常困难的问题，在某些情况下吉布斯算法可能无法预测出真正的基序。因此，作者探索了在保持其快速运行性能的同时提高吉布斯算法预测准确性的可能性。在本文中，作者仅将吉布斯算法应用于蛋白质，而非DNA结合位点。作者开发了iGibbs，这是一种用于蛋白质的集成基序搜索框架，它采用了作者自己先前的两种技术：一种是通过对序列进行聚类来指导基序搜索，另一种是通过模式细化来指导。这两种技术被组合成一种新的双重聚类方法来指导基序搜索。他们框架的独特之处在于，当基序出现在输入序列的不同子集中时，用户不必指定要预测的基序数量，因为它会自动将输入序列聚类，并从这些聚类中预测基序。在PROSITE数据库上的测试表明，他们的框架显著提高了吉布斯算法的预测准确性。与像MEME这样更详尽的搜索方法相比，iGibbs预测基序更准确，并且运行速度快一个数量级。