Dang Cuong Cao, Le Vinh Sy, Gascuel Olivier, Hazes Bart, Le Quang Si
The Wellcome Trust Center for Human Genetics, Oxford University, Oxford, UK.
BMC Bioinformatics. 2014 Oct 24;15(1):341. doi: 10.1186/1471-2105-15-341.
Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled.
The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used.
FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.
氨基酸替换率矩阵是许多蛋白质分析系统(如序列相似性搜索、序列比对和系统发育推断)的关键组成部分。理想情况下,该速率矩阵反映了所研究实际数据的突变行为;然而,估计氨基酸替换率矩阵需要大量的蛋白质比对,且计算成本高昂且复杂。作为一种折衷方案,次优的预先计算的通用矩阵通常用于基于蛋白质的系统发育分析。随着序列可用性的不断提高,如果能够控制计算成本,现在通常可以计算特定问题的速率矩阵。
通过最大似然法估计速率矩阵时,最耗时的步骤是从蛋白质比对构建最大似然系统发育树。我们提出了一种名为FastMG的新方法来克服这一障碍。关键创新在于比对拆分算法,该算法在估计氨基酸替换率之前,将包含多个序列的比对拆分为不重叠的子比对。对不同大数据集的实验表明,FastMG方法比不拆分时快一个数量级。重要的是,如果使用适当的拆分程序,矩阵质量不会有明显损失。
FastMG是一种从大数据集中估计氨基酸替换率矩阵的简单、快速且准确的方法。它使研究人员能够使用优化的、特定数据的氨基酸替换率矩阵来研究特定蛋白质组或分类群的进化关系。相关程序、数据集以及新哺乳动物线粒体蛋白质速率矩阵可在http://fastmg.codeplex.com获取。