Sun Yanni, Buhler Jeremy
Department of Computer Science and Engineering, Washington University, St Louis, MO 63130-4899, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2009 Apr-Jun;6(2):232-43. doi: 10.1109/TCBB.2008.14.
Profile HMMs are powerful tools for modeling conserved motifs in proteins. They are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome. It is highly desirable to speed up HMM search in large databases. We design PROSITE-like patterns and short profiles that are used as filters to rapidly eliminate protein-motif pairs for which a full profile HMM comparison does not yield a significant match. The design of the pattern-based filters is formulated as a multichoice knapsack problem. Profile-based filters with high sensitivity are extracted from a profile HMM based on their theoretical sensitivity and false positive rate. Experiments show that our profile-based filters achieve high sensitivity (near 100 percent) while keeping around 20\times speedup with respect to the unfiltered search program. Pattern-based filters typically retain at least 90 percent of the sensitivity of the source HMM with 30-40\times speedup. The profile-based filters have sensitivity comparable to the multistage filtering strategy HMMERHEAD [15] and are faster in most of our experiments.
隐马尔可夫模型(Profile HMMs)是用于对蛋白质中保守基序进行建模的强大工具。搜索工具广泛使用它们,根据结构域架构将新的蛋白质序列分类到不同家族中。然而,已知基序和新的蛋白质组序列数据的激增给搜索带来了计算挑战,注释一个生物体的蛋白质组需要数天的CPU时间。非常希望能加快在大型数据库中的隐马尔可夫模型搜索速度。我们设计了类似PROSITE的模式和短概况,用作过滤器,快速排除那些完整的概况隐马尔可夫模型比较未产生显著匹配的蛋白质 - 基序对。基于模式的过滤器设计被表述为一个多选择背包问题。基于概况的高灵敏度过滤器根据其理论灵敏度和误报率从概况隐马尔可夫模型中提取。实验表明,我们基于概况的过滤器实现了高灵敏度(接近100%),同时相对于未过滤的搜索程序实现了约20倍的加速。基于模式的过滤器通常保留源隐马尔可夫模型至少90%的灵敏度,加速30 - 40倍。基于概况的过滤器的灵敏度与多阶段过滤策略HMMERHEAD [15]相当,并且在我们的大多数实验中速度更快。