Thomas Elizabeth E, Srebro Nathan, Sebat Jonathan, Navin Nicholas, Healy John, Mishra Bud, Wigler Michael
Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA.
Proc Natl Acad Sci U S A. 2004 Jul 13;101(28):10349-54. doi: 10.1073/pnas.0403727101. Epub 2004 Jul 6.
Mammalian genomes are densely populated with long duplicated sequences. In this paper, we demonstrate the existence of doublets, short duplications between 25 and 100 bp, distinct from previously described repeats. Each doublet is a pair of exact matches, separated by some distance. The distribution of these intermatch distances is strikingly nonrandom. An unexpectedly high number of doublets have matches either within 100 bp (adjacent) or at distances tightly concentrated approximately 1,000 bp apart (nearby). We focus our study on these proximate doublets. First, they tend to have both matches on the same strand. By comparing nearby doublets shared in human and chimpanzee, we can also see that these doublets seem to arise by an insertion event that produces a copy without markedly affecting the surrounding sequence. Most doublets in humans are shared with chimpanzee, but many new pairs arose after the divergence of the species. Doublets found in human but not chimpanzee are most often composed of almost tandem matches, whereas older doublets (found in both species) are more likely to have matches spaced by approximately 1 kb, indicating that the nearly tandem doublets may be more dynamic. The spacing of doublets is highly conserved. So far, we have found clearly recognizable doublets in the following genomes: Homo sapiens, Mus musculus, Arabidopsis thaliana, and Caenorhabditis elegans, indicating that the mechanism generating these doublets is widespread. A mechanism that generates short local duplications while conserving polarity could have a profound impact on the evolution of regulatory and protein-coding sequences.
哺乳动物基因组中密集分布着长重复序列。在本文中,我们证明了双联体的存在,即25至100个碱基对之间的短重复序列,与先前描述的重复序列不同。每个双联体是一对精确匹配序列,中间相隔一定距离。这些匹配序列间距离的分布明显是非随机的。数量出乎意料地多的双联体在100个碱基对以内(相邻)或距离紧密集中在大约1000个碱基对处(附近)有匹配序列。我们将研究重点放在这些邻近的双联体上。首先,它们往往在同一条链上都有匹配序列。通过比较人类和黑猩猩共有的邻近双联体,我们还可以看到,这些双联体似乎是由一个插入事件产生的,该事件产生了一个拷贝,而对周围序列没有明显影响。人类中的大多数双联体与黑猩猩共有,但在物种分化后出现了许多新的双联体对。在人类中发现但在黑猩猩中未发现的双联体最常由几乎串联的匹配序列组成,而较古老的双联体(在两个物种中都有发现)更有可能其匹配序列间隔约1千碱基对,这表明几乎串联的双联体可能更具动态性。双联体的间隔高度保守。到目前为止,我们已经在以下基因组中发现了清晰可辨的双联体:智人、小家鼠、拟南芥和秀丽隐杆线虫,这表明产生这些双联体的机制很普遍。一种在保留极性的同时产生短局部重复序列的机制可能对调控序列和蛋白质编码序列的进化产生深远影响。