将MAFFT序列比对程序应用于对链式引导树实用性的大数据重新检验。

Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.

作者信息

Yamada Kazunori D, Tomii Kentaro, Katoh Kazutaka

机构信息

Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan.

Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan.

出版信息

Bioinformatics. 2016 Nov 1;32(21):3246-3251. doi: 10.1093/bioinformatics/btw412. Epub 2016 Jul 4.

DOI:10.1093/bioinformatics/btw412

PMID:27378296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5079479/

Abstract

MOTIVATION

Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones.

RESULTS

We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use.

AVAILABILITY AND IMPLEMENTATION

http://mafft.cbrc.jp/alignment/software/ CONTACT: katoh@ifrec.osaka-u.ac.jpSupplementary information: Supplementary data are available at Bioinformatics online.

摘要

动机

由于测序技术的进步，由数千个序列组成的大型多序列比对（MSA）越来越普遍。MAFFT MSA程序有多种构建大型MSA的选项，但由于对大型MSA进行实际基准测试很困难，其性能尚未得到充分评估。最近，通过HomFam和ContTest基准蛋白质数据集使得此类评估成为可能。随着这些数据集的发展，提出了一个有趣的理论：链式引导树可提高结构保守区域MSA的准确性。该理论挑战了渐进比对方法的基础，需要通过与其他已知方法（包括计算量较大的方法）进行比较来检验。

结果

我们使用HomFam、ContTest和OXFam（OXBench的扩展版本）来评估MAFFT中启用的几种方法：（1）使用近似引导树的渐进方法，（2）使用链式引导树的渐进方法，（3）迭代优化方法和渐进方法的组合，以及（4）使用严格引导树和一致性得分的近似程度较低的渐进方法。用于大型MSA的其他程序Clustal Omega和UPP也被纳入比较。方法2（链式引导树）在ContTest中效果为正，但在HomFam和OXFam中为负。对于这三个数据集，方法3和4比方法2更一致地提高了基准分数，表明它们使用起来更安全。

可用性和实现方式

http://mafft.cbrc.jp/alignment/software/ 联系方式：katoh@ifrec.osaka-u.ac.jp 补充信息：补充数据可在《生物信息学》在线获取。

相似文献

Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.将MAFFT序列比对程序应用于对链式引导树实用性的大数据重新检验。

Bioinformatics. 2016 Nov 1;32(21):3246-3251. doi: 10.1093/bioinformatics/btw412. Epub 2016 Jul 4.

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.使用从头蛋白质结构预测来衡量非常大的多序列比对的质量。

Bioinformatics. 2016 Mar 15;32(6):814-20. doi: 10.1093/bioinformatics/btv592. Epub 2015 Nov 14.

A simple method to control over-alignment in the MAFFT multiple sequence alignment program.一种在MAFFT多序列比对程序中控制过度比对的简单方法。

Bioinformatics. 2016 Jul 1;32(13):1933-42. doi: 10.1093/bioinformatics/btw108. Epub 2016 Feb 26.

Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。

Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

Making automated multiple alignments of very large numbers of protein sequences.对大量蛋白质序列进行自动多重比对。

Bioinformatics. 2013 Apr 15;29(8):989-95. doi: 10.1093/bioinformatics/btt093. Epub 2013 Feb 21.

Adding unaligned sequences into an existing alignment using MAFFT and LAST.使用 MAFFT 和 LAST 将未对齐的序列添加到现有比对中。

Bioinformatics. 2012 Dec 1;28(23):3144-6. doi: 10.1093/bioinformatics/bts578. Epub 2012 Sep 27.

Simple chained guide trees give high-quality protein multiple sequence alignments.简单的链式引导树可生成高质量的蛋白质多重序列比对。

Proc Natl Acad Sci U S A. 2014 Jul 22;111(29):10556-61. doi: 10.1073/pnas.1405628111. Epub 2014 Jul 7.

PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences.PartTree：一种从大量未比对序列构建近似树的算法。

Bioinformatics. 2007 Feb 1;23(3):372-4. doi: 10.1093/bioinformatics/btl592. Epub 2006 Nov 21.

The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses.引导树对多序列比对及后续系统发育分析的影响。

Pac Symp Biocomput. 2008:25-36. doi: 10.1142/9789812776136_0004.

MAFFT-DASH: integrated protein sequence and structural alignment.MAFFT-DASH：集成蛋白质序列和结构比对。

Nucleic Acids Res. 2019 Jul 2;47(W1):W5-W10. doi: 10.1093/nar/gkz342.

引用本文的文献

Ancestral sequence reconstruction as a tool for structural analysis of modular polyketide synthases.祖先序列重建作为模块化聚酮合酶结构分析的一种工具。

Nat Commun. 2025 Jul 25;16(1):6847. doi: 10.1038/s41467-025-62168-0.

Engineering Saccharomyces Cerevisiae With Novel Functional Xylose Isomerases From Rumen Microbiota for Enhanced Biofuel Production.利用瘤胃微生物群中的新型功能性木糖异构酶改造酿酒酵母以提高生物燃料产量

Biotechnol J. 2025 Jun;20(6):e70050. doi: 10.1002/biot.70050.

Analysis of RNA Transcribed by RNA Polymerase III from B2 SINEs in Mouse Cells.小鼠细胞中RNA聚合酶III转录的来自B2短散在核元件的RNA分析。

Noncoding RNA. 2025 May 14;11(3):39. doi: 10.3390/ncrna11030039.

Analysis of genetic diversity of populations in central and south-eastern Ethiopia.埃塞俄比亚中部和东南部人群的遗传多样性分析。

Front Plant Sci. 2025 Apr 9;16:1505455. doi: 10.3389/fpls.2025.1505455. eCollection 2025.

Next-generation sequencing-based population genetics unravels the evolutionary history of Rhodomyrtus tomentosa in China.基于新一代测序的群体遗传学揭示了中国桃金娘的进化历史。

BMC Plant Biol. 2025 Mar 15;25(1):338. doi: 10.1186/s12870-025-06364-6.

Genomic evidence on the distribution and ecological function of Pseudomonas in hadal zone.深海超深渊带假单胞菌分布及生态功能的基因组学证据

BMC Microbiol. 2025 Feb 28;25(1):100. doi: 10.1186/s12866-025-03834-7.

Gelling and reducing agents are potential carbon and energy sources in culturing of anaerobic microorganisms.胶凝剂和还原剂在厌氧微生物培养中是潜在的碳源和能源。

Appl Environ Microbiol. 2025 Mar 19;91(3):e0227624. doi: 10.1128/aem.02276-24. Epub 2025 Feb 12.

Design of ancestral mammalian alkaline phosphatase bearing high stability and productivity.具有高稳定性和高产量的原始哺乳动物碱性磷酸酶的设计

Appl Environ Microbiol. 2024 Dec 18;90(12):e0183124. doi: 10.1128/aem.01831-24. Epub 2024 Nov 15.

Evolutionary history of calcium-sensing receptors unveils hyper/hypocalcemia-causing mutations.钙敏感受体的进化历史揭示了引起高/低钙血症的突变。

PLoS Comput Biol. 2024 Nov 12;20(11):e1012591. doi: 10.1371/journal.pcbi.1012591. eCollection 2024 Nov.

Are there conserved biosynthetic genes in lichens? Genome-wide assessment of terpene biosynthetic genes suggests ubiquitous distribution of the squalene synthase cluster.地衣中是否存在保守的生物合成基因？萜类生物合成基因的全基因组评估表明角鲨烯合酶簇的普遍分布。

BMC Genomics. 2024 Oct 7;25(1):936. doi: 10.1186/s12864-024-10806-0.

本文引用的文献

A simple method to control over-alignment in the MAFFT multiple sequence alignment program.一种在MAFFT多序列比对程序中控制过度比对的简单方法。

Bioinformatics. 2016 Jul 1;32(13):1933-42. doi: 10.1093/bioinformatics/btw108. Epub 2016 Feb 26.

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.使用从头蛋白质结构预测来衡量非常大的多序列比对的质量。

Bioinformatics. 2016 Mar 15;32(6):814-20. doi: 10.1093/bioinformatics/btv592. Epub 2015 Nov 14.

Ultra-large alignments using phylogeny-aware profiles.使用系统发育感知概况的超大比对。

Genome Biol. 2015 Jun 16;16(1):124. doi: 10.1186/s13059-015-0688-z.

Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks.在模拟和系统发育基准测试中，简单的链式引导树给出的多序列比对结果比推断树的结果更差。

Proc Natl Acad Sci U S A. 2015 Jan 13;112(2):E99-100. doi: 10.1073/pnas.1417526112. Epub 2015 Jan 6.

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.PASTA：用于核苷酸和氨基酸序列的超大多重序列比对

J Comput Biol. 2015 May;22(5):377-86. doi: 10.1089/cmb.2014.0156. Epub 2014 Dec 30.

Systematic exploration of guide-tree topology effects for small protein alignments.系统探索引导树拓扑结构效应对小蛋白比对的影响。

BMC Bioinformatics. 2014 Oct 4;15(1):338. doi: 10.1186/1471-2105-15-338.

Simple chained guide trees give high-quality protein multiple sequence alignments.简单的链式引导树可生成高质量的蛋白质多重序列比对。

Proc Natl Acad Sci U S A. 2014 Jul 22;111(29):10556-61. doi: 10.1073/pnas.1405628111. Epub 2014 Jul 7.

TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction.TCS：一种新的多重序列比对可靠性度量方法，用于估计比对准确性并改进系统发育树重建。

Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1.

Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era.在序列和结构丰富的时代评估基于共进化的残基-残基接触预测的效用。

Proc Natl Acad Sci U S A. 2013 Sep 24;110(39):15674-9. doi: 10.1073/pnas.1314045110. Epub 2013 Sep 5.

Adding unaligned sequences into an existing alignment using MAFFT and LAST.使用 MAFFT 和 LAST 将未对齐的序列添加到现有比对中。

Bioinformatics. 2012 Dec 1;28(23):3144-6. doi: 10.1093/bioinformatics/bts578. Epub 2012 Sep 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验