Xiao Peng, Cai Xingyu, Rajasekaran Sanguthevar
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):27-37. doi: 10.1109/TCBB.2020.3024222. Epub 2021 Feb 3.
Discovering patterns in biological sequences is a crucial step to extract useful information from them. Motifs can be viewed as patterns that occur exactly or with minor changes across some or all of the biological sequences. Motif search has numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity among families of proteins, etc. The general problem of motif search is intractable. One of the most studied models of motif search proposed in literature is Edit-distance based Motif Search (EMS). In EMS, the goal is to find all the patterns of length l that occur with an edit-distance of at most d in each of the input sequences. EMS algorithms existing in the literature do not scale well on challenging instances and large datasets. In this paper, the current state-of-the-art EMS solver is advanced by exploiting the idea of dimension reduction. A novel idea to reduce the cardinality of the alphabet is proposed. The algorithm we propose, EMS3, is an exact algorithm. I.e., it finds all the motifs present in the input sequences. EMS3 can be also viewed as a divide and conquer algorithm. In this paper, we provide theoretical analyses to establish the efficiency of EMS3. Extensive experiments on standard benchmark datasets (synthetic and real-world) show that the proposed algorithm outperforms the existing state-of-the-art algorithm (EMS2).
在生物序列中发现模式是从这些序列中提取有用信息的关键步骤。基序可以被视为在部分或所有生物序列中精确出现或略有变化的模式。基序搜索有许多应用,包括转录因子及其结合位点的识别、复合调控模式、蛋白质家族间的相似性等。基序搜索的一般问题是难以处理的。文献中提出的最受研究的基序搜索模型之一是基于编辑距离的基序搜索(EMS)。在EMS中,目标是找到在每个输入序列中编辑距离至多为d出现的所有长度为l的模式。文献中现有的EMS算法在具有挑战性的实例和大型数据集上扩展性不佳。在本文中,通过利用降维思想改进了当前最先进的EMS求解器。提出了一种降低字母表基数的新颖想法。我们提出的算法EMS3是一种精确算法。即,它能找到输入序列中存在的所有基序。EMS3也可以被视为一种分治算法。在本文中,我们提供理论分析以确定EMS3的效率。在标准基准数据集(合成和真实世界)上进行的大量实验表明,所提出的算法优于现有的最先进算法(EMS2)。