Al-Turaiki Isra, Badr Ghada, Mathkour Hassan
Int J Data Min Bioinform. 2015;13(1):13-30. doi: 10.1504/ijdmb.2015.070833.
Motif discovery is the problem of finding recurring patterns in biological sequences. It is one of the hardest and long-standing problems in bioinformatics. Apriori is a well-known data-mining algorithm for the discovery of frequent patterns in large datasets. In this paper, we apply the Apriori algorithm and use the Trie data structure to discover motifs. We propose several modifications so that we can adapt the classic Apriori to our problem. Experiments are conducted on Tompa's benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for the average sensitivity measure, which means that our approach is able to discover more motifs. In terms of specificity, the performance of our algorithm is comparable to the other tools. The results also confirm both linear time and linear space scalability of the algorithm.
基序发现是在生物序列中寻找重复模式的问题。它是生物信息学中最困难且长期存在的问题之一。Apriori是一种用于在大型数据集中发现频繁模式的著名数据挖掘算法。在本文中,我们应用Apriori算法并使用Trie数据结构来发现基序。我们提出了几种修改方法,以便能够将经典的Apriori算法应用于我们的问题。在Tompa的基准数据集上进行了实验,以研究我们提出的算法——基于Trie的Apriori基序发现算法(TrieAMD)的性能。结果表明,在真实数据集上,对于平均灵敏度度量,我们的算法优于所有测试工具,这意味着我们的方法能够发现更多的基序。在特异性方面,我们算法的性能与其他工具相当。结果还证实了该算法具有线性时间和线性空间的可扩展性。