Zhang Shijie, Su Wei, Yang Jiong
Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA.
Bioinformatics. 2009 Jan 15;25(2):183-9. doi: 10.1093/bioinformatics/btn609. Epub 2008 Dec 9.
The goal of motif discovery is to detect novel, unknown, and important signals from biology sequences. In most models, the importance of a motif is equal to the sum of the similarity of every single position. In 2006, Song et al. introduced Aggregated Related Column Score (ARCS) measure which includes correlation information to the evaluation of motif importance. The paper showed that the ARCS measure is superior to other measures. Due to the complicated nature of the ARCS motif model, we cannot directly apply existing sequential motif discovery methods to find motifs with high ARCS values.
This article presents a novel mining algorithm, ARCS-Motif, to discover related sequential motifs in biological sequences. ARCS-Motif is applied to 400 PROSITE datasets and compared with five alternative methods (CONSENSUS, Gibbs sampler, MEME, SPLASH and DIALIGN-TX). ARCS-Motif outperforms all the methods in accuracy, and most of the methods in efficiency. Although SPLASH has better efficiency than ARCS-Motif, ARCS-Motif has much better accuracy than SPLASH. On average, ARCS-Motif is able to produce the motifs which are at least 10% better than the best of the alternative methods. Among the 400 PROSITE datasets, ARCS-Motif produces the best motifs for more than 200 families. Other than SPLASH, the execution time of ARCS-Motif is less than a third of that of the fastest alternative method and its execution time grows at the slowest rate with respect to the number of sequences and the average sequence among all methods.
基序发现的目标是从生物序列中检测新的、未知的和重要的信号。在大多数模型中,基序的重要性等于每个位置相似性的总和。2006年,宋等人引入了聚合相关列得分(ARCS)度量,该度量将相关信息纳入对基序重要性的评估中。该论文表明ARCS度量优于其他度量。由于ARCS基序模型的性质复杂,我们不能直接应用现有的序列基序发现方法来寻找具有高ARCS值的基序。
本文提出了一种新颖的挖掘算法ARCS-Motif,用于在生物序列中发现相关的序列基序。将ARCS-Motif应用于400个PROSITE数据集,并与五种替代方法(CONSENSUS、吉布斯采样器、MEME、SPLASH和DIALIGN-TX)进行比较。ARCS-Motif在准确性方面优于所有方法,在效率方面优于大多数方法。虽然SPLASH的效率比ARCS-Motif高,但ARCS-Motif的准确性比SPLASH好得多。平均而言,ARCS-Motif能够产生比最佳替代方法至少好10%的基序。在400个PROSITE数据集中,ARCS-Motif为超过200个家族产生了最佳基序。除了SPLASH之外,ARCS-Motif的执行时间不到最快替代方法的三分之一,并且相对于序列数量和平均序列而言,其执行时间的增长速度是所有方法中最慢的。