Sindi Suzanne S, Hunt Brian R, Yorke James A
Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA.
Phys Rev E Stat Nonlin Soft Matter Phys. 2008 Dec;78(6 Pt 1):061912. doi: 10.1103/PhysRevE.78.061912. Epub 2008 Dec 11.
We study quantitative features of complex repetitive DNA in several genomes by studying sequences that are sufficiently long that they are unlikely to have repeated by chance. For each genome we study, we determine the number of identical copies, the "duplication count," of each sequence of length 40, that is of each "40-mer." We say a 40-mer is "repeated" if its duplication count is at least 2. We focus mainly on "complex" 40-mers, those without short internal repetitions. We find that we can classify most of the complex repeated 40-mers into two categories: one category has its copies clustered closely together on one chromosome, the other has its copies distributed widely across multiple chromosomes. For each genome and each of the categories above, we compute N(c), the number of 40-mers that have duplication count c, for each integer c. In each case, we observe a power-law-like decay in N(c) as c increases from 3 to 50 or higher. In particular, we find that N(c) decays much more slowly than would be predicted by evolutionary models where each 40-mer is equally likely to be duplicated. We also analyze an evolutionary model that does reflect the slow decay of N(c).
我们通过研究足够长以至于不太可能偶然重复的序列,来研究多个基因组中复杂重复DNA的定量特征。对于我们研究的每个基因组,我们确定长度为40的每个序列(即每个“40聚体”)的相同拷贝数,即“重复计数”。如果一个40聚体的重复计数至少为2,我们就说它是“重复的”。我们主要关注“复杂”的40聚体,即那些没有短内部重复的40聚体。我们发现,我们可以将大多数复杂的重复40聚体分为两类:一类其拷贝在一条染色体上紧密聚集在一起,另一类其拷贝广泛分布在多条染色体上。对于每个基因组以及上述每一类,我们针对每个整数c计算具有重复计数c的40聚体的数量N(c)。在每种情况下,我们观察到随着c从3增加到50或更高,N(c)呈现出类似幂律的衰减。特别是,我们发现N(c)的衰减比每个40聚体被复制的可能性相同的进化模型所预测的要慢得多。我们还分析了一个确实反映N(c)缓慢衰减的进化模型。