检测真核生物全基因组序列中的系统发育信号。

Detecting phylogenetic signals in eukaryotic whole genome sequences.

作者信息

Cohen Eyal, Chor Benny

机构信息

School of Computer Science, Tel-Aviv University, Israel.

出版信息

J Comput Biol. 2012 Aug;19(8):945-56. doi: 10.1089/cmb.2012.0122.

DOI:10.1089/cmb.2012.0122

PMID:22876786

Abstract

Whole genome sequences are a rich source of molecular data, with a potential for the discovery of novel evolutionary information. Yet, many parts of these sequences are not known to be under evolutionary pressure and, thus, are not conserved. Furthermore, a good model for whole genome evolution does not exist. Consequently, it is not a priori clear if a meaningful phylogenetic signal exists and can be extracted from the sequences as a whole. Indeed, very few phylogenies were reconstructed based on these sequences. Prior to this work, only two reconstruction methods were applied to large eukaryotic genomes: the K(r) method (Haubold et al., 2009), which was applied to genomes of rather small diversity (Drosophila species), and the feature frequency profile method (Sims et al., 2009a), which was applied to genomes of moderate diversity (mammals). We investigate the whole genome-based phylogenetic reconstruction question with respect to a much wider taxonomic sample. We apply K(r), FFP, and an alternative alignment-free method, the average common subsequence (ACS) (Ulitsky et al., 2006), to 24 multicellular eukaryotes (vertebrates, invertebrates, and plants). We also apply ACS to the proteome sequences of these 24 taxa. We compare the resulting trees to a standard reference, the National Center for Biotechnology Information (NCBI) taxonomy tree. Trees produced by ACS(AA), based on proteomes, are in complete agreement with the NCBI tree. For the genome-based reconstruction, ACS(DNA) produces trees whose agreement with the NCBI tree is excellent to very good for divergence times up to 800 million years ago, medium at 1 billion years ago, and poor at 1.6 billion years ago. We conclude that whole genomes do carry a clear phylogenetic signal, yet this signal "saturates" with longer divergence times. Furthermore, from the few existing methods, ACS is best capable of detecting this signal.

摘要

全基因组序列是丰富的分子数据来源，具有发现新的进化信息的潜力。然而，这些序列的许多部分并不处于进化压力之下，因此并不保守。此外，目前还不存在一个适用于全基因组进化的良好模型。因此，事先并不清楚是否存在有意义的系统发育信号，以及能否从整个序列中提取出来。事实上，基于这些序列重建的系统发育树非常少。在这项工作之前，只有两种重建方法应用于大型真核生物基因组：K(r)方法（Haubold等人，2009年），应用于多样性较小的基因组（果蝇物种）；特征频率谱方法（Sims等人，2009a），应用于中等多样性的基因组（哺乳动物）。我们针对更广泛分类样本研究基于全基因组的系统发育重建问题。我们将K(r)、FFP以及另一种无比对方法——平均共同子序列（ACS）（Ulitsky等人，2006年）应用于24种多细胞真核生物（脊椎动物、无脊椎动物和植物）。我们还将ACS应用于这24个分类单元的蛋白质组序列。我们将得到的树与标准参考——美国国立生物技术信息中心（NCBI）分类树进行比较。基于蛋白质组的ACS(AA)产生的树与NCBI树完全一致。对于基于基因组的重建，ACS(DNA)产生的树在分歧时间达8亿年前时与NCBI树的一致性极佳至良好，在10亿年前时为中等，在16亿年前时较差。我们得出结论，全基因组确实携带清晰的系统发育信号，但该信号会随着分歧时间延长而“饱和”。此外，在现有的几种方法中，ACS最能检测到这种信号。