Czeizler Elena, Hirvola Tommi, Karhu Kalle
IEEE/ACM Trans Comput Biol Bioinform. 2017 Jan-Feb;14(1):121-130. doi: 10.1109/TCBB.2015.2511750. Epub 2015 Dec 23.
Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.
由于蛋白质基序的多样性,基序识别在生物信息学中是一个具有挑战性的问题。许多现有算法只能识别给定长度的基序,因此在同时搜索各种长度的基序时,要么不适用,要么效率不高。搜索带间隙的基序虽然非常重要,但由于考虑长间隙时可能组合的组合爆炸,这是一项非常耗时的任务。我们引入了一种新的图论方法来识别各种长度的基序,包括有间隙和无间隙的情况。我们将我们的方法与两种广泛使用的方法进行比较:MEME和GLAM2,同时分析结果的质量和所需的计算时间。我们的方法提供的结果质量略高于MEME,但速度要快得多,即MEME查询时间的八分之一。通过使用相似性索引,我们将查询时间降低到平均约为GLAM2所需时间的六分之一,同时实现略高的结果质量。更准确地说,对于小于50000字节的序列集合,GLAM2的速度慢13倍,而在较大的集合上至少与我们的方法一样快。我们C++实现的源代码可在GitHub上免费获取:https://github.com/hirvolt1/debruijn-motif 。