Department of Genome Sciences, Howard Hughes Medical Institute, University of Washington School of Medicine, Seattle, Washington 98195, USA.
Genome Res. 2011 Jan;21(1):137-45. doi: 10.1101/gr.111278.110. Epub 2010 Nov 16.
Despite its importance in cell biology and evolution, the centromere has remained the final frontier in genome assembly and annotation due to its complex repeat structure. However, isolation and characterization of the centromeric repeats from newly sequenced species are necessary for a complete understanding of genome evolution and function. In recent years, various genomes have been sequenced, but the characterization of the corresponding centromeric DNA has lagged behind. Here, we present a computational method (RepeatNet) to systematically identify higher-order repeat structures from unassembled whole-genome shotgun sequence and test whether these sequence elements correspond to functional centromeric sequences. We analyzed genome datasets from six species of mammals representing the diversity of the mammalian lineage, namely, horse, dog, elephant, armadillo, opossum, and platypus. We define candidate monomer satellite repeats and demonstrate centromeric localization for five of the six genomes. Our analysis revealed the greatest diversity of centromeric sequences in horse and dog in contrast to elephant and armadillo, which showed high-centromeric sequence homogeneity. We could not isolate centromeric sequences within the platypus genome, suggesting that centromeres in platypus are not enriched in satellite DNA. Our method can be applied to the characterization of thousands of other vertebrate genomes anticipated for sequencing in the near future, providing an important tool for annotation of centromeres.
尽管着丝粒在细胞生物学和进化中具有重要意义,但由于其复杂的重复结构,它仍然是基因组组装和注释的最后一个领域。然而,为了全面了解基因组的进化和功能,有必要从新测序的物种中分离和鉴定着丝粒重复序列。近年来,已经对各种基因组进行了测序,但相应的着丝粒 DNA 的特征描述却落后了。在这里,我们提出了一种计算方法(RepeatNet),用于从未组装的全基因组鸟枪法序列中系统地识别高阶重复结构,并测试这些序列元件是否对应于功能着丝粒序列。我们分析了来自六个哺乳动物物种的基因组数据集,这些物种代表了哺乳动物谱系的多样性,即马、狗、大象、犰狳、负鼠和鸭嘴兽。我们定义了候选单体卫星重复,并证明了其中五个基因组的着丝粒定位。我们的分析显示,在马和狗中,着丝粒序列的多样性最大,而在大象和犰狳中,着丝粒序列的同源性很高。我们无法在鸭嘴兽基因组中分离出着丝粒序列,这表明鸭嘴兽的着丝粒中卫星 DNA 并不丰富。我们的方法可应用于对未来预期测序的数千种其他脊椎动物基因组的特征描述,为着丝粒的注释提供了重要工具。