Regier Jerome C, Shultz Jeffrey W, Ganley Austen R D, Hussey April, Shi Diane, Ball Bernard, Zwick Andreas, Stajich Jason E, Cummings Michael P, Martin Joel W, Cunningham Clifford W
Center for Biosystems Research, University of Maryland Biotechnology Institute, College Park, Maryland 20742, USA.
Syst Biol. 2008 Dec;57(6):920-38. doi: 10.1080/10635150802570791.
This study attempts to resolve relationships among and within the four basal arthropod lineages (Pancrustacea, Myriapoda, Euchelicerata, Pycnogonida) and to assess the widespread expectation that remaining phylogenetic problems will yield to increasing amounts of sequence data. Sixty-eight regions of 62 protein-coding nuclear genes (approximately 41 kilobases (kb)/taxon) were sequenced for 12 taxonomically diverse arthropod taxa and a tardigrade outgroup. Parsimony, likelihood, and Bayesian analyses of total nucleotide data generally strongly supported the monophyly of each of the basal lineages represented by more than one species. Other relationships within the Arthropoda were also supported, with support levels depending on method of analysis and inclusion/exclusion of synonymous changes. Removing third codon positions, where the assumption of base compositional homogeneity was rejected, altered the results. Removing the final class of synonymous mutations--first codon positions encoding leucine and arginine, which were also compositionally heterogeneous--yielded a data set that was consistent with a hypothesis of base compositional homogeneity. Furthermore, under such a data-exclusion regime, all 68 gene regions individually were consistent with base compositional homogeneity. Restricting likelihood analyses to nonsynonymous change recovered trees with strong support for the basal lineages but not for other groups that were variably supported with more inclusive data sets. In a further effort to increase phylogenetic signal, three types of data exploration were undertaken. (1) Individual genes were ranked by their average rate of nonsynonymous change, and three rate categories were assigned--fast, intermediate, and slow. Then, bootstrap analysis of each gene was performed separately to see which taxonomic groups received strong support. Five taxonomic groups were strongly supported independently by two or more genes, and these genes mostly belonged to the slow or intermediate categories, whereas groups supported only by a single gene region tended to be from genes of the fast category, arguing that fast genes provide a less consistent signal. (2) A sensitivity analysis was performed in which increasing numbers of genes were excluded, beginning with the fastest. The number of strongly supported nodes increased up to a point and then decreased slightly. Recovery of Hexapoda required removal of fast genes. Support for Mandibulata (Pancrustacea + Myriapoda) also increased, at times to "strong" levels, with removal of the fastest genes. (3) Concordance selection was evaluated by clustering genes according to their ability to recover Pancrustacea, Euchelicerata, or Myriapoda and analyzing the three clusters separately. All clusters of genes recovered the three concordance clades but were at times inconsistent in the relationships recovered among and within these clades, a result that indicates that the a priori concordance criteria may bias phylogenetic signal in unexpected ways. In a further attempt to increase support of taxonomic relationships, sequence data from 49 additional taxa for three slow genes (i.e., EF-1 alpha, EF-2, and Pol II) were combined with the various 13-taxon data sets. The 62-taxon analyses supported the results of the 13-taxon analyses and provided increased support for additional pancrustacean clades found in an earlier analysis including only EF-1 alpha, EF-2, and Pol II.
本研究试图解析四个基础节肢动物谱系(泛甲壳动物、多足纲、真螯肢动物、海蜘蛛纲)之间以及谱系内部的关系,并评估一种广泛存在的预期,即随着序列数据量的不断增加,剩余的系统发育问题将得到解决。对12个分类学上多样化的节肢动物类群和一个缓步动物外群,对62个蛋白质编码核基因的68个区域(约41千碱基(kb)/分类单元)进行了测序。对总核苷酸数据进行简约法、似然法和贝叶斯分析,总体上强烈支持由多个物种代表的每个基础谱系的单系性。节肢动物内部的其他关系也得到了支持,支持水平取决于分析方法以及同义变化的包含/排除情况。去除第三密码子位置(其碱基组成均匀性假设被拒绝)改变了结果。去除最后一类同义突变——编码亮氨酸和精氨酸的第一密码子位置,其碱基组成也不均匀——得到了一个与碱基组成均匀性假设一致的数据集。此外,在这种数据排除机制下,所有68个基因区域单独来看都与碱基组成均匀性一致。将似然分析限制在非同义变化上,得到的树对基础谱系有很强的支持,但对其他群体的支持则不然,而在更全面的数据集中这些群体得到了不同程度的支持。为了进一步增加系统发育信号,进行了三种类型的数据探索。(1)根据每个基因的非同义变化平均速率对基因进行排序,并分配三个速率类别——快、中、慢。然后分别对每个基因进行自展分析,以查看哪些分类学群体得到了强烈支持。五个分类学群体由两个或更多基因独立地强烈支持,这些基因大多属于慢或中类别,而仅由单个基因区域支持的群体往往来自快类别基因,这表明快基因提供的信号不太一致。(2)进行了敏感性分析,从最快的基因开始逐步排除越来越多的基因。得到强烈支持的节点数量先增加到某一点,然后略有下降。恢复六足动物需要去除快基因。随着去除最快的基因,对有颚类(泛甲壳动物 + 多足纲)的支持也增加了,有时达到“强烈”水平。(3)通过根据基因恢复泛甲壳动物、真螯肢动物或多足纲的能力对基因进行聚类,并分别分析这三个聚类,来评估一致性选择。所有基因聚类都恢复出了三个一致性分支,但这些分支之间以及分支内部恢复的关系有时并不一致,这一结果表明先验的一致性标准可能会以意想不到的方式使系统发育信号产生偏差。为了进一步尝试增加对分类关系的支持,将来自另外49个分类群的三个慢基因(即EF - 1α、EF - 2和Pol II)的序列数据与各种13分类单元数据集相结合。62分类单元分析支持了13分类单元分析的结果,并为早期仅包括EF - 1α、EF - 2和Pol II的分析中发现的额外泛甲壳动物分支提供了更多支持。