Kurt Lienau E, DeSalle Rob, Allard Marc, Brown Eric W, Swofford David, Rosenfeld Jeffrey A, Sarkar Indra N, Planet Paul J
Sackler Institute for Comparative Genomics, American Museum of Natural History, Central Park West at 79th St, New York, NY 10024, USA.
Department of Biology, Graduate School of Arts and Science, New York University, 100 Washington Square East, New York, NY 10003, USA.
Cladistics. 2011 Aug;27(4):417-427. doi: 10.1111/j.1096-0031.2010.00337.x. Epub 2010 Aug 26.
Because horizontal gene transfer can confound the recovery of the largely prokaryotic tree of life (ToL), most genome-based techniques seek to eliminate horizontal signal from ToL analyses, commonly by sieving out incongruent genes and data. This approach greatly limits the number of gene families analysed to a subset thought to be representative of vertical evolutionary history. However, formalized tests have not been performed to determine whether combining the massive amounts of information available in fully sequenced genomes can recover a reasonable ToL. Consequently, we used empirically defined gene homology definitions from a previous study that delineate xenologous gene families (gene families derived from a common transfer event) to generate a massively concatenated, combined-data ToL matrix derived from 323 404 translated open reading frames arranged into 12 381 gene homologue groups coded as amino acid data and 63 336, 64 105, 65 153, 66 922 and 67 109 gene homologue groups coded as gene presence/absence data for 166 fully sequenced genomes. This whole-genome gene presence/absence and amino acid sequence ToL data matrix is composed of 4867 184 characters (a combined data-type mega-matrix). Phylogenetic analysis of this mega-matrix yielded a fully resolved ToL that classifies all three commonly accepted domains of life as monophyletic and groups most taxa in traditionally recognized locations with high support. Most importantly, these results corroborate the existence of a common evolutionary history for these taxa present in both data types that is evident only when these data are analysed in combination. © The Willi Hennig Society 2010.
由于水平基因转移会混淆主要为原核生物的生命之树(ToL)的重建,大多数基于基因组的技术试图从ToL分析中消除水平信号,通常是通过筛除不一致的基因和数据。这种方法极大地限制了所分析基因家族的数量,使其仅为被认为代表垂直进化历史的一个子集。然而,尚未进行正式测试来确定整合全基因组测序中可用的大量信息是否能够重建合理的生命之树。因此,我们使用了先前一项研究中根据经验定义的基因同源性定义,这些定义描绘了异源基因家族(源自共同转移事件的基因家族),以生成一个大规模串联的、组合数据的ToL矩阵,该矩阵来自323404个翻译后的开放阅读框,这些阅读框被排列成12381个编码为氨基酸数据的基因同源组,以及63336、64105、65153、66922和67109个编码为166个全基因组测序的基因存在/缺失数据的基因同源组。这个全基因组基因存在/缺失和氨基酸序列的ToL数据矩阵由4867184个字符组成(一个组合数据类型的巨型矩阵)。对这个巨型矩阵的系统发育分析产生了一个完全解析的生命之树,它将所有三个普遍认可的生命域分类为单系,并将大多数分类群聚集在传统认可的位置,且支持度很高。最重要的是,这些结果证实了这两种数据类型中存在的这些分类群具有共同的进化历史,而这只有在对这些数据进行联合分析时才明显。©威利·亨尼希协会2010年。