Department of Computer Architecture and Computer Technology, CITIC-UGR, Department of Applied Mathematics, University of Granada, Granada, Spain.
Bioinformatics. 2013 Sep 1;29(17):2112-21. doi: 10.1093/bioinformatics/btt360. Epub 2013 Jun 21.
Multiple sequence alignments (MSAs) are widely used approaches in bioinformatics to carry out other tasks such as structure predictions, biological function analyses or phylogenetic modeling. However, current tools usually provide partially optimal alignments, as each one is focused on specific biological features. Thus, the same set of sequences can produce different alignments, above all when sequences are less similar. Consequently, researchers and biologists do not agree about which is the most suitable way to evaluate MSAs. Recent evaluations tend to use more complex scores including further biological features. Among them, 3D structures are increasingly being used to evaluate alignments. Because structures are more conserved in proteins than sequences, scores with structural information are better suited to evaluate more distant relationships between sequences.
The proposed multiobjective algorithm, based on the non-dominated sorting genetic algorithm, aims to jointly optimize three objectives: STRIKE score, non-gaps percentage and totally conserved columns. It was significantly assessed on the BAliBASE benchmark according to the Kruskal-Wallis test (P < 0.01). This algorithm also outperforms other aligners, such as ClustalW, Multiple Sequence Alignment Genetic Algorithm (MSA-GA), PRRP, DIALIGN, Hidden Markov Model Training (HMMT), Pattern-Induced Multi-sequence Alignment (PIMA), MULTIALIGN, Sequence Alignment Genetic Algorithm (SAGA), PILEUP, Rubber Band Technique Genetic Algorithm (RBT-GA) and Vertical Decomposition Genetic Algorithm (VDGA), according to the Wilcoxon signed-rank test (P < 0.05), whereas it shows results not significantly different to 3D-COFFEE (P > 0.05) with the advantage of being able to use less structures. Structural information is included within the objective function to evaluate more accurately the obtained alignments.
The source code is available at http://www.ugr.es/~fortuno/MOSAStrE/MO-SAStrE.zip.
多序列比对(MSA)是生物信息学中广泛使用的方法,用于执行其他任务,如结构预测、生物功能分析或系统发育建模。然而,当前的工具通常提供部分最优的比对,因为每个工具都专注于特定的生物特征。因此,同一组序列可能会产生不同的比对,尤其是在序列相似度较低的情况下。因此,研究人员和生物学家对于评估 MSA 的最合适方法存在分歧。最近的评估倾向于使用更复杂的分数,包括进一步的生物特征。其中,3D 结构越来越多地被用于评估比对。由于结构在蛋白质中比序列更保守,因此具有结构信息的分数更适合评估序列之间更远的关系。
所提出的基于非支配排序遗传算法的多目标算法旨在联合优化三个目标:STRIKE 评分、非空位百分比和完全保守列。根据 Kruskal-Wallis 检验(P < 0.01),该算法在 BAliBASE 基准上进行了显著评估。该算法还优于其他比对器,如 ClustalW、多序列比对遗传算法(MSA-GA)、PRRP、DIALIGN、隐马尔可夫模型训练(HMMT)、模式诱导多序列比对(PIMA)、MULTIALIGN、序列比对遗传算法(SAGA)、PILEUP、橡胶带技术遗传算法(RBT-GA)和垂直分解遗传算法(VDGA),根据 Wilcoxon 符号秩检验(P < 0.05),而与 3D-COFFEE 的结果没有显著差异(P > 0.05),其优点是能够使用更少的结构。结构信息包含在目标函数中,以更准确地评估获得的比对。