Yamada Kazunori D, Tomii Kentaro, Katoh Kazutaka
Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan.
Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan.
Bioinformatics. 2016 Nov 1;32(21):3246-3251. doi: 10.1093/bioinformatics/btw412. Epub 2016 Jul 4.
Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones.
We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use.
http://mafft.cbrc.jp/alignment/software/ CONTACT: katoh@ifrec.osaka-u.ac.jpSupplementary information: Supplementary data are available at Bioinformatics online.
由于测序技术的进步,由数千个序列组成的大型多序列比对(MSA)越来越普遍。MAFFT MSA程序有多种构建大型MSA的选项,但由于对大型MSA进行实际基准测试很困难,其性能尚未得到充分评估。最近,通过HomFam和ContTest基准蛋白质数据集使得此类评估成为可能。随着这些数据集的发展,提出了一个有趣的理论:链式引导树可提高结构保守区域MSA的准确性。该理论挑战了渐进比对方法的基础,需要通过与其他已知方法(包括计算量较大的方法)进行比较来检验。
我们使用HomFam、ContTest和OXFam(OXBench的扩展版本)来评估MAFFT中启用的几种方法:(1)使用近似引导树的渐进方法,(2)使用链式引导树的渐进方法,(3)迭代优化方法和渐进方法的组合,以及(4)使用严格引导树和一致性得分的近似程度较低的渐进方法。用于大型MSA的其他程序Clustal Omega和UPP也被纳入比较。方法2(链式引导树)在ContTest中效果为正,但在HomFam和OXFam中为负。对于这三个数据集,方法3和4比方法2更一致地提高了基准分数,表明它们使用起来更安全。
http://mafft.cbrc.jp/alignment/software/ 联系方式:katoh@ifrec.osaka-u.ac.jp 补充信息:补充数据可在《生物信息学》在线获取。