Department of Computer Science, The University of Hong Kong, Rm 301, Chow Yei Ching Building, Pokfulam Road, Hong Kong.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1629-38. doi: 10.1109/TCBB.2012.104.
Structural alignment has been shown to be an effective computational method to identify structural noncoding RNA(ncRNA) candidates as ncRNAs are known to be conserved in secondary structures. However, the complexity of the structural alignment algorithms becomes higher when the structure has pseudoknots. Even for the simplest type of pseudoknots (simple pseudoknots), the fastest algorithm runs in O(mn3) time, where m, n are the length of the query ncRNA (with known structure) and the length of the target sequence (with unknown structure), respectively. In practice, we are usually given a long DNA sequence and we try to locate regions in the sequence for possible candidates of a particular ncRNA. Thus, we need to run the structural alignment algorithm on every possible region in the long sequence. For example, finding candidates for a known ncRNA of length 100 on a sequence of length 50,000, it takes more than one day. In this paper, we provide an efficient algorithm to solve the problem for simple pseudoknots and it is shown to be 10 times faster. The speedup stems from an effective pruning strategy consisting of the computation of a lower bound score for the optimal alignment and an estimation of the maximum score that a candidate can achieve to decide whether to prune the current candidate or not.
结构比对已被证明是一种有效的计算方法,可以识别结构非编码 RNA(ncRNA)候选物,因为已知 ncRNA 在二级结构中是保守的。然而,当结构具有假结时,结构比对算法的复杂性会更高。即使是最简单类型的假结(简单假结),最快的算法也需要在 O(mn3)时间内运行,其中 m、n 分别是查询 ncRNA(具有已知结构)和目标序列(具有未知结构)的长度。在实践中,我们通常会得到一个长的 DNA 序列,并尝试在序列中找到特定 ncRNA 的可能候选区域。因此,我们需要在长序列的每个可能区域上运行结构比对算法。例如,在长度为 50000 的序列上查找已知长度为 100 的 ncRNA 的候选物,需要一天多的时间。在本文中,我们提供了一种针对简单假结的有效算法,其速度提高了 10 倍。加速源于一种有效的剪枝策略,该策略包括计算最优比对的下界得分和估计候选物可以达到的最大得分,以决定是否剪枝当前候选物。