Laboratory of Artificial and Natural Evolution (LANE), Department of Zoology and Animal Biology, Sciences III, 30, Quai Ernest-Ansermet, 1211 Geneva 4, Switzerland.
Genome Biol. 2010;11(2):R16. doi: 10.1186/gb-2010-11-2-r16. Epub 2010 Feb 9.
Given the availability of full genome sequences, mapping gene gains, duplications, and losses during evolution should theoretically be straightforward. However, this endeavor suffers from overemphasis on detecting conserved genome features, which in turn has led to sequencing multiple eutherian genomes with low coverage rather than fewer genomes with high-coverage and more even distribution in the phylogeny. Although limitations associated with analysis of low coverage genomes are recognized, they have not been quantified.
Here, using recently developed comparative genomic application systems, we evaluate the impact of low-coverage genomes on inferences pertaining to gene gains and losses when analyzing eukaryote genome evolution through gene duplication. We demonstrate that, when performing inference of genome content evolution, low-coverage genomes generate not only a massive number of false gene losses, but also striking artifacts in gene duplication inference, especially at the most recent common ancestor of low-coverage genomes. We show that the artifactual gains are caused by the low coverage of genome sequence per se rather than by the increased taxon sampling in a biased portion of the species tree.
We argue that it will remain difficult to differentiate artifacts from true changes in modes and tempo of genome evolution until there is better homogeneity in both taxon sampling and high-coverage sequencing. This is important for broadening the utility of full genome data to the community of evolutionary biologists, whose interests go well beyond widely conserved physiologies and developmental patterns as they seek to understand the generative mechanisms underlying biological diversity.
鉴于完整基因组序列的可用性,在进化过程中绘制基因增益、重复和缺失的图谱在理论上应该是简单的。然而,这一努力过于强调检测保守的基因组特征,这反过来又导致对多个真兽类基因组进行低覆盖率测序,而不是对具有高覆盖率和更均匀分布在系统发育中的较少基因组进行测序。尽管人们认识到与低覆盖率基因组分析相关的局限性,但这些局限性尚未得到量化。
在这里,我们使用最近开发的比较基因组应用系统,评估了在通过基因重复分析真核生物基因组进化时,低覆盖率基因组对基因增益和损失推断的影响。我们证明,在进行基因组内容进化推断时,低覆盖率基因组不仅会产生大量的假基因丢失,而且还会在基因重复推断中产生明显的假象,尤其是在低覆盖率基因组的最近共同祖先处。我们表明,这些人为的增益是由基因组序列的低覆盖率本身引起的,而不是由物种树中偏向部分的分类群采样增加引起的。
我们认为,在分类群采样和高覆盖率测序都具有更好的同质性之前,很难将假象与基因组进化模式和节奏的真实变化区分开来。这对于将全基因组数据的用途扩展到进化生物学家群体非常重要,他们的兴趣远远超出了广泛保守的生理和发育模式,因为他们试图理解生物多样性的生成机制。