Gupta Radhey S, Sneath Peter H A
Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Canada L8N 3Z5.
J Mol Evol. 2007 Jan;64(1):90-100. doi: 10.1007/s00239-006-0082-2. Epub 2006 Dec 9.
The character compatibility approach, which removes all homoplasic characters and involves finding the largest clique of compatible characters in a dataset, in principle, provides a powerful means for obtaining correct topology in difficult to resolve cases. However, the usefulness of this approach to generalized molecular sequence data for phylogeny determination has not been studied in the past. We have used this approach to determine the topology of 23 proteobacterial species (6 each of alpha-, beta- and gamma-, 3 delta-, and 2 epsilon-proteobacteria) using sequence data for 10 conserved proteins (Hsp60, Hsp70, EF-Tu, EF-G, alanyl-tRNA synthetase, RecA, GyrA, GyrB, RpoB and RpoC). All sites in the sequence alignments of these proteins where only two amino acids were found, with each amino acid present in at least two species, were selected. Mutual compatibility determination on these binary state sites was carried out by two means. In one case, all of these sites were combined into a large dataset (Set A; 957 characters) prior to compatibility analysis. In the second case, compatibility analysis was carried out on characters from individual proteins and all compatible sites were combined into a large dataset (Set B; 398 characters) for further studies. Upon compatibility analyses, the largest cliques that were obtained from Sets A and B consisted of 337 and 323 compatible characters, respectively. In these cliques, all proteobacterial subgroups were clearly distinguished and branching orders of most of the species were also resolved. The epsilon-proteobacteria exhibited the earliest branching, whereas the beta- and gamma-subgroups were found to have emerged last. The relative placement of the alpha- and delta-subgroups, however, was not resolved. The topology of these species was also determined based on 16S rRNA sequences and a concatenated dataset of sequences for all 10 proteins by means of neighbor-joining, maximum likelihood, and maximum parsimony methods. In the protein trees, all proteobacterial groups were reliably resolved and they branched in the following order: (epsilon(delta(alpha(beta,gamma)))). However, in the rRNA trees, the gamma- and beta-subgroups exhibited polyphyletic branching and many internal nodes were not resolved. These results indicate that the character compatibility analysis using generalized molecular sequence data provides a powerful means for evolutionary studies. Based on molecular sequences, it should be possible to obtain very large datasets of compatible characters that should prove very helpful in clarifying difficult to resolve phylogenetic relationships.
性状兼容性方法通过去除所有同塑性状,并在数据集中寻找最大的兼容性状团,原则上为在难以解析的情况下获得正确的拓扑结构提供了一种强大的手段。然而,过去尚未研究过这种方法对用于系统发育确定的广义分子序列数据的实用性。我们使用这种方法,利用10种保守蛋白(热休克蛋白60、热休克蛋白70、延伸因子Tu、延伸因子G、丙氨酰-tRNA合成酶、重组酶A、DNA促旋酶A、DNA促旋酶B、RNA聚合酶B和RNA聚合酶C)的序列数据,来确定23种变形菌门细菌(α-、β-和γ-变形菌各6种,δ-变形菌3种,ε-变形菌2种)的拓扑结构。在这些蛋白质的序列比对中,只发现两种氨基酸且每种氨基酸至少存在于两个物种中的所有位点都被选了出来。通过两种方式对这些二元状态位点进行相互兼容性测定。在一种情况下,在进行兼容性分析之前,将所有这些位点合并成一个大型数据集(集合A;957个性状)。在第二种情况下,对单个蛋白质的性状进行兼容性分析,并将所有兼容位点合并成一个大型数据集(集合B;398个性状)用于进一步研究。经过兼容性分析,从集合A和集合B中获得的最大团分别由337个和323个兼容性状组成。在这些团中,所有变形菌亚群都能清晰区分,并且大多数物种的分支顺序也得到了解析。ε-变形菌表现出最早分支,而β-和γ-亚群则是最后出现的。然而,α-和δ-亚群的相对位置没有得到解析。还通过邻接法、最大似然法和最大简约法,基于16S rRNA序列以及所有10种蛋白质序列的串联数据集来确定这些物种的拓扑结构。在蛋白质树中,所有变形菌群都能可靠地解析出来,它们按以下顺序分支:(ε(δ(α(β,γ))))。然而,在rRNA树中,γ-和β-亚群表现出多系分支,许多内部节点没有得到解析。这些结果表明,使用广义分子序列数据进行性状兼容性分析为进化研究提供了一种强大的手段。基于分子序列,应该有可能获得非常大的兼容性状数据集,这在阐明难以解析的系统发育关系方面应该会非常有帮助。