Ayad Lorraine A K, Pissis Solon P
Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
BMC Genomics. 2017 Jan 14;18(1):86. doi: 10.1186/s12864-016-3477-5.
A fundamental assumption of all widely-used multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary due to a number of reasons: arbitrariness in the linearisation (sequencing) of a circular molecular structure; or inconsistencies introduced into sequence databases due to different linearisation standards. These scenarios are relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure. A solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; these refined sequences may in turn lead to improved multiple sequence alignments using the preferred multiple sequence alignment program.
We present MARS, a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. MARS was implemented in the C++ programming language as a program to compute the rotations (cyclic shifts) required to best align a set of input sequences. Experimental results, using real and synthetic data, show that MARS improves the alignments, with respect to standard genetic measures and the inferred maximum-likelihood-based phylogenies, and outperforms state-of-the-art methods both in terms of accuracy and efficiency. Our results show, among others, that the average pairwise distance in the multiple sequence alignment of a dataset of widely-studied mitochondrial DNA sequences is reduced by around 5% when MARS is applied before a multiple sequence alignment is performed.
Analysing multiple sequences simultaneously is fundamental in biological research and multiple sequence alignment has been found to be a popular method for this task. Conventional alignment techniques cannot be used effectively when the position where sequences start is arbitrary. We present here a method, which can be used in conjunction with any multiple sequence alignment program, to address this problem effectively and efficiently.
所有广泛使用的多序列比对技术的一个基本假设是,输入序列的最左端和最右端位置与比对相关。然而,由于多种原因,序列开始或结束的位置可能完全是任意的:环状分子结构线性化(测序)中的任意性;或由于不同的线性化标准而引入序列数据库中的不一致性。例如,在具有环状分子结构的线粒体DNA、类病毒、病毒或其他基因组的多序列比对过程中,这些情况是相关的。解决这些不一致性的一个方法是为每个序列确定一个合适的旋转(循环移位);这些经过优化的序列反过来可能会使用首选的多序列比对程序改进多序列比对。
我们提出了MARS,一种使用优化序列改进多环状序列比对的新启发式方法。MARS用C++编程语言实现,作为一个程序来计算最佳比对一组输入序列所需的旋转(循环移位)。使用真实和合成数据的实验结果表明,MARS在标准遗传指标和基于推断的最大似然系统发育方面改进了比对,并且在准确性和效率方面都优于现有方法。我们的结果表明,除其他外,在进行多序列比对之前应用MARS时,一组广泛研究的线粒体DNA序列数据集的多序列比对中的平均成对距离减少了约5%。
同时分析多个序列是生物学研究的基础,多序列比对已被发现是完成这项任务的一种常用方法。当序列开始的位置是任意的时候,传统的比对技术不能有效地使用。我们在此提出一种方法,它可以与任何多序列比对程序结合使用,以有效且高效地解决这个问题。