Keith Jonathan M, Adams Peter, Bryant Darryn, Kroese Dirk P, Mitchelson Keith R, Cochran Duncan A E, Lala Gita H
Department of Mathematics, The University of Queensland, Qld 4072, Australia.
Bioinformatics. 2002 Nov;18(11):1494-9. doi: 10.1093/bioinformatics/18.11.1494.
A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules.
This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases.
顾名思义,相关序列家族的共有序列是一种能够捕捉该家族大多数成员共同特征的序列。共有序列在各种DNA测序应用中都很重要,并且是表征一类分子的便捷方式。
本文描述了一种使用称为模拟退火的流行优化方法来寻找共有序列的新算法。与通过首先形成多序列比对来寻找共有序列的传统方法不同,该算法搜索的序列能使与每个输入序列的成对距离之和最小化。然后,得到的共有序列可用于诱导多序列比对。该算法所需的时间与输入序列的数量呈线性比例关系,与共有序列的长度呈二次方比例关系。我们展示的结果表明了新算法产生的共有序列和比对的高质量。为作比较,我们还展示了使用ClustalW获得的类似结果。在许多情况下,新算法的性能优于ClustalW。