Nicolae Marius, Rajasekaran Sanguthevar
Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA.
Sci Rep. 2015 Jan 15;5:7813. doi: 10.1038/srep07813.
Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (ℓ, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers ℓ and d. It returns all sequences M of length ℓ that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (ℓ, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.
在生物序列中发现模式是一个关键问题。例如,对DNA序列模式的识别已导致开放阅读框的确定、基因启动子元件的识别、内含子/外显子剪接位点和SH RNA的识别、RNA降解信号的定位、可变剪接位点的识别等。在蛋白质序列中,模式已导致结构域识别、蛋白酶切割位点的定位、信号肽的识别、蛋白质相互作用、蛋白质降解元件的确定、蛋白质转运元件的识别、短功能基序的发现等。在本文中,我们专注于一类重要模式即基序的识别。我们研究(ℓ, d)基序搜索问题或植入基序搜索(PMS)。PMS的输入是n个字符串以及两个整数ℓ和d。它返回在每个输入字符串中出现的长度为ℓ的所有序列M,其中每次出现与M最多在d个位置上不同。另一种表述是法定人数PMS (qPMS),其中基序出现在至少q%的字符串中。我们引入qPMS9,一种并行精确qPMS算法,它在DNA和蛋白质数据集上显著提高了运行时间。qPMS9解决了具有挑战性的DNA (ℓ, d)实例(28, 12)和(30, 13)。源代码可在https://code.google.com/p/qpms9/获取。