Department of Botany, The Field Museum, 1400 South Lake Shore Drive, Chicago, IL 60605-2496, USA.
BMC Bioinformatics. 2011 Jan 7;12:10. doi: 10.1186/1471-2105-12-10.
We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.
Including ambiguous regions coded by means of PICS-Ord increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICS-Ord performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICS-Ord scores to be analyzed together with DNA data in a partitioned maximum likelihood model.
Advantages of PICS-Ord over step matrix-based ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICS-Ord codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word- and frequency-based methods, PICS-Ord maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICS-Ord codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32-state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICS-Ord codes partition, with GTR performing slightly better than MK and ORDERED.
An implementation of the PICS-Ord algorithm is available from http://scit.us/projects/ngila/wiki/PICS-Ord. It requires both the statistical software, R http://www.r-project.org and the alignment software Ngila http://scit.us/projects/ngila.
我们提出了一种新的方法,通过“成对身份和成本得分排序”(PICS-Ord)对固定多重序列比对中的歧义对齐区域进行编码。该方法通过主坐标分析(PCoA)对序列同一性或成本得分矩阵进行排序。在确定歧义区域后,该方法计算序列同一性或成本得分的成对距离,通过 PCoA 对所得距离矩阵进行排序,并将主坐标编码为有序整数。使用三个生物数据集和 100 个模拟数据集来评估新方法的性能。
与从分析中预先排除此类区域的替代方法相比,在真实生物和模拟数据集中,包含通过 PICS-Ord 编码的歧义区域可提高拓扑准确性、分辨率和引导支持。在准确性方面,PICS-Ord 的表现与先前可用的歧义区域编码方法(例如 INAASE)相当或更好,具有实际上无限的对齐大小、增加的分析速度以及 PICS-Ord 得分与 DNA 数据在分区最大似然模型中一起分析的可能性的优势。
与基于步长矩阵的带有 INAASE 的模糊区域编码相比,PICS-Ord 的优势包括实际上无限数量的 OTU 和 PICS-Ord 代码与系统发育数据集的无缝集成,以及系统发育分析速度的提高。与基于单词和频率的方法相反,PICS-Ord 保持了从成对序列比对中推导距离的优势,并且该方法在距离得分的计算方面具有灵活性。除了距离和最大简约性之外,还可以在贝叶斯或最大似然框架中分析 PICS-Ord 代码。RAxML(为这项研究开发的版本 7.2.6 或更高版本)允许使用 32 状态有序或无序字符。可以应用 GTR、MK 或 ORDERED 模型来分析 PICS-Ord 代码分区,其中 GTR 的表现略优于 MK 和 ORDERED。
PICS-Ord 算法的实现可从 http://scit.us/projects/ngila/wiki/PICS-Ord 获得。它需要统计软件 R http://www.r-project.org 和对齐软件 Ngila http://scit.us/projects/ngila。