Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
PLoS One. 2011;6(5):e20025. doi: 10.1371/journal.pone.0020025. Epub 2011 May 25.
The need for efficient algorithms to uncover biologically relevant phosphorylation motifs has become very important with rapid expansion of the proteomic sequence database along with a plethora of new information on phosphorylation sites. Here we present a novel unsupervised method, called Motif Finder (in short, F-Motif) for identification of phosphorylation motifs. F-Motif uses clustering of sequence information represented by numerical features that exploit the statistical information hidden in some foreground data. Furthermore, these identified motifs are then filtered to find "actual" motifs with statistically significant motif scores.
We have applied F-Motif to several new and existing data sets and compared its performance with two well known state-of-the-art methods. In almost all cases F-Motif could identify all statistically significant motifs extracted by the state-of-the-art methods. More importantly, in addition to this, F-Motif uncovers several novel motifs. We have demonstrated using clues from the literature that most of these new motifs discovered by F-Motif are indeed novel. We have also found some interesting phenomena. For example, for CK2 kinase, the conserved sites appear only on the right side of S. However, for CDK kinase, the adjacent site on the right of S is conserved with residue P. In addition, three different encoding methods, including a novel position contrast matrix (PCM) and the simplest binary coding, are used and the ability of F-motif to discover motifs remains quite robust with respect to encoding schemes.
An iterative algorithm proposed here uses exploratory data analysis to discover motifs from phosphorylated data. The effectiveness of F-Motif has been demonstrated using several real data sets as well as using a synthetic data set. The method is quite general in nature and can be used to find other types of motifs also. We have also provided a server for F-Motif at http://f-motif.classcloud.org/, http://bio.classcloud.org/f-motif/ or http://ymu.classcloud.org/f-motif/.
随着蛋白质组序列数据库的快速扩展以及大量新的磷酸化位点信息的出现,需要高效的算法来揭示生物学上相关的磷酸化模体。在这里,我们提出了一种新的无监督方法,称为Motif Finder(简称 F-Motif),用于识别磷酸化模体。F-Motif 使用以数值特征表示的序列信息聚类,这些特征利用隐藏在一些前景数据中的统计信息。此外,还对这些鉴定出的模体进行过滤,以找到具有统计学意义的 motif 得分的“实际”模体。
我们将 F-Motif 应用于几个新的和现有的数据集,并将其性能与两种著名的最先进的方法进行了比较。在几乎所有情况下,F-Motif 都可以识别最先进的方法提取的所有具有统计学意义的模体。更重要的是,除了这一点之外,F-Motif 还揭示了一些新的模体。我们已经通过文献中的线索证明,F-Motif 发现的大多数新模体确实是新的。我们还发现了一些有趣的现象。例如,对于 CK2 激酶,保守位点仅出现在 S 的右侧。然而,对于 CDK 激酶,S 右侧的相邻位点与残基 P 保守。此外,使用了三种不同的编码方法,包括一种新的位置对比矩阵(PCM)和最简单的二进制编码,并且 F-motif 发现模体的能力对于编码方案仍然相当稳健。
这里提出的迭代算法使用探索性数据分析从磷酸化数据中发现模体。使用几个真实数据集以及使用合成数据集证明了 F-Motif 的有效性。该方法本质上非常通用,也可用于发现其他类型的模体。我们还在 http://f-motif.classcloud.org/、http://bio.classcloud.org/f-motif/ 或 http://ymu.classcloud.org/f-motif/ 上为 F-Motif 提供了一个服务器。