Hsu Chen-Ming, Chen Chien-Yu, Liu Baw-Jhiune
Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, 106, Taiwan.
Algorithms Mol Biol. 2011 Mar 31;6(1):6. doi: 10.1186/1748-7188-6-6.
Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.
WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for discovering functional regions of a single protein by referring to a set of related sequences (e.g. its homologues). The discovered W-patterns are used to characterize the protein sequence and the results are compared with the conserved positions identified by multiple sequence alignment (MSA). The family-based mining mode of WildSpan is developed for extracting sequence signatures for a group of related proteins (e.g. a protein family) for protein function classification. In this situation, the discovered W-patterns are compared with PROSITE patterns as well as the patterns generated by three existing methods performing the similar task. Finally, analysis on execution time of running WildSpan reveals that the proposed pruning strategy is effective in improving the scalability of the proposed algorithm.
The mining results conducted in this study reveal that WildSpan is efficient and effective in discovering functional signatures of proteins directly from sequences. The proposed pruning strategy is effective in improving the scalability of WildSpan. It is demonstrated in this study that the W-patterns discovered by WildSpan provides useful information in characterizing protein sequences. The WildSpan executable and open source codes are available on the web (http://biominer.csie.cyu.edu.tw/wildspan).
从生物序列中自动提取基序是分子生物学研究中的一个重要问题。对于蛋白质而言,人们希望发现包含大量通配符符号的序列基序,因为与功能位点相关的残基在序列中通常相距甚远。发现此类模式非常耗时,因为考虑长间隙(一个间隙由一个或多个连续的通配符组成)时存在大量组合。挖掘算法通常采用约束来缩小搜索空间以提高效率。然而,不恰当的约束模型可能会降低通过计算方法发现的基序的敏感性和特异性。我们之前提出了一种新的约束模型来处理大的通配符区域,以发现蛋白质的功能基序。满足所提出约束模型的模式称为W模式。W模式是一种结构化基序,它将基序符号分组为与大的不规则间隙交错的模式块。考虑大间隙反映了功能残基并非总是来自蛋白质序列的单个区域这一事实,而将基序符号限制为簇则对应于蛋白质家族中经常存在短基序的观察结果。为了有效地发现用于大规模序列注释和功能预测的W模式,本文首先正式介绍要解决的问题,并提出一种名为WildSpan(跨大的通配符区域的序列模式挖掘)的算法,该算法结合了几种剪枝策略以大幅降低挖掘成本。
WildSpan被证明能够有效地找到包含在序列中相距甚远的保守残基的W模式。我们使用两种挖掘策略,即基于蛋白质的挖掘和基于家族的挖掘,来评估W模式的有用性以及WildSpan的性能。WildSpan的基于蛋白质的挖掘模式是通过参考一组相关序列(例如其同源物)来发现单个蛋白质的功能区域而开发的。所发现的W模式用于表征蛋白质序列,并将结果与通过多序列比对(MSA)确定的保守位置进行比较。WildSpan的基于家族的挖掘模式是为提取一组相关蛋白质(例如一个蛋白质家族)的序列特征以进行蛋白质功能分类而开发的。在这种情况下,将所发现的W模式与PROSITE模式以及三种执行类似任务的现有方法生成的模式进行比较。最后,对运行WildSpan的执行时间的分析表明,所提出的剪枝策略在提高所提出算法的可扩展性方面是有效的。
本研究中的挖掘结果表明,WildSpan在直接从序列中发现蛋白质的功能特征方面是高效且有效的。所提出的剪枝策略在提高WildSpan的可扩展性方面是有效的。本研究表明,WildSpan发现的W模式在表征蛋白质序列方面提供了有用的信息。WildSpan的可执行文件和开源代码可在网上获取(http://biominer.csie.cyu.edu.tw/wildspan)。