Korhonen Janne H, Palin Kimmo, Taipale Jussi, Ukkonen Esko
School of Computer Science, Reykjavík University, Reykjavík, Iceland.
Helsinki Institute for Information Technology HIIT, Helsinki, Finland.
Bioinformatics. 2017 Feb 15;33(4):514-521. doi: 10.1093/bioinformatics/btw683.
While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking.
We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm.
Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ).
虽然位置权重矩阵(PWM)是序列基序最常用的模型,但越来越多的证据表明更先进的模型(如一阶马尔可夫表示)也很有用,并且这些模型在知名的基序数据库中也已可用。关于如何从训练数据中学习这些模型已有大量研究,但通过将模型与新序列进行匹配来预测所学习基序的假定位点的问题却较少受到关注。此外,基序位点分析通常关注序列中的不同变体如何影响这些位点。然而,到目前为止,用于基序匹配的相应高效软件工具一直缺乏。
我们针对上述任务开发了快速基序匹配算法。首先,我们基于高阶位置权重矩阵形式化了一个框架,用于具有二核苷酸或一般q -聚体依赖性的基序模型的通用表示,并使快速PWM匹配算法适用于高阶PWM框架。其次,我们展示了如何将不同类型的序列变体(如单核苷酸多态性和插入缺失)及其综合影响纳入高效的PWM匹配工作流程。基准测试结果表明,我们的算法在基因组大小的序列集上实际运行良好,并且对于多个基序搜索比基本滑动窗口算法快得多。
实现作为MOODS软件包的一部分提供,遵循GNU通用公共许可证v3.0和Biopython许可证(http://www.cs.helsinki.fi/group/pssmfind)。