Loire Etienne, Praz Françoise, Higuet Dominique, Netter Pierre, Achaz Guillaume
Université Pierre et Marie Curie-Paris 6, Unité Mixte de recherche 7592, Institut Jacques Monod, Paris, France.
Mol Biol Evol. 2009 Jan;26(1):111-21. doi: 10.1093/molbev/msn230. Epub 2008 Oct 8.
Simple sequence repeats (SSRs) are very common short repeats in eukaryotic genomes. "Long" SSRs are considered "hypermutable" sequences because they exhibit a high rate of expansion and contraction. Because they are potentially deleterious, long SSRs tend to be uncommon in coding sequences. However, several genes contain long SSRs in their exonic sequences. Here, we identify 1,291 human genes that host a mononucleotide SSR long enough to be prone to expansion or contraction, being called hypermutable hereafter. On the basis of Gene Ontology annotations, we show that only a restricted number of functions are overrepresented among those hypermutable genes including cell cycle and maintenance of DNA integrity. Using a probabilistic model, we show that genes involved in these functions are expected to host long SSRs because they tend to be long and/or are biased in nucleotide composition. Finally, we show that for almost all functions we observe fewer hypermutable sequences than expected under a neutral model. There are however interesting exceptions, for example, genes involved in protein and RNA transport, as well as meiosis and mismatch repair functions that have as many hypermutable genes as expected under neutrality. Conversely, there are functions (e.g., collagen-related genes) where hypermutable genes are more often avoided than in other functions. Our results show that, even though several functions harbor unusually long SSR in their exons, long SSRs are deleterious sequences in almost all functions and are removed by purifying selection. The strength of this purifying selection however greatly varies from function to function. We discuss possible explanations for this intriguing result.
简单序列重复(SSRs)是真核生物基因组中非常常见的短重复序列。“长”SSRs被认为是“高变”序列,因为它们表现出较高的扩增和收缩率。由于它们可能具有有害性,长SSRs在编码序列中往往不常见。然而,有几个基因在其外显子序列中包含长SSRs。在这里,我们鉴定出1291个人类基因,这些基因含有足够长的单核苷酸SSRs,易于扩增或收缩,此后被称为高变序列。基于基因本体注释,我们表明在这些高变基因中只有有限数量的功能过度富集,包括细胞周期和DNA完整性的维持。使用概率模型,我们表明参与这些功能的基因预计会包含长SSRs,因为它们往往较长和/或在核苷酸组成上存在偏差。最后,我们表明对于几乎所有功能,我们观察到的高变序列比中性模型下预期的要少。然而,也有一些有趣的例外,例如参与蛋白质和RNA运输的基因,以及减数分裂和错配修复功能的基因,它们具有与中性条件下预期数量相同的高变基因。相反,有些功能(如胶原蛋白相关基因)中,高变基因比其他功能中更常被避免。我们的结果表明,尽管有几个功能在外显子中含有异常长的SSRs,但长SSRs在几乎所有功能中都是有害序列,并通过纯化选择被去除。然而,这种纯化选择的强度在不同功能之间差异很大。我们讨论了这一有趣结果的可能解释。