Oliver J L, Bernaola-Galván P, Carpena P, Román-Roldán R
Departamento de Genética, Instituto de Biotecnología, Universidad de Granada, E-18071, Granada, Spain.
Gene. 2001 Oct 3;276(1-2):47-56. doi: 10.1016/s0378-1119(01)00641-2.
Analytical DNA ultracentrifugation revealed that eukaryotic genomes are mosaics of isochores: long DNA segments (>>300 kb on average) relatively homogeneous in G+C. Important genome features are dependent on this isochore structure, e.g. genes are found predominantly in the GC-richest isochore classes. However, no reliable method is available to rigorously partition the genome sequence into relatively homogeneous regions of different composition, thereby revealing the isochore structure of chromosomes at the sequence level. Homogeneous regions are currently ascertained by plain statistics on moving windows of arbitrary length, or simply by eye on G+C plots. On the contrary, the entropic segmentation method is able to divide a DNA sequence into relatively homogeneous, statistically significant domains. An early version of this algorithm only produced domains having an average length far below the typical isochore size. Here we show that an improved segmentation method, specifically intended to determine the most statistically significant partition of the sequence at each scale, is able to identify the boundaries between long homogeneous genome regions displaying the typical features of isochores. The algorithm precisely locates classes II and III of the human major histocompatibility complex region, two well-characterized isochores at the sequence level, the boundary between them being the first isochore boundary experimentally characterized at the sequence level. The analysis is then extended to a collection of human large contigs. The relatively homogeneous regions we find show many of the features (G+C range, relative proportion of isochore classes, size distribution, and relationship with gene density) of the isochores identified through DNA centrifugation. Isochore chromosome maps, with many potential applications in genomics, are then drawn for all the completely sequenced eukaryotic genomes available.
分析性DNA超速离心显示,真核生物基因组是等密度区带的镶嵌体:即平均长度>>300 kb的DNA长片段,其G+C含量相对均匀。重要的基因组特征取决于这种等密度区带结构,例如,基因主要存在于G+C含量最高的等密度区带类别中。然而,目前尚无可靠方法能将基因组序列严格划分为不同组成的相对均匀区域,从而在序列水平上揭示染色体的等密度区带结构。目前,均匀区域是通过对任意长度的移动窗口进行简单统计,或者仅仅通过观察G+C图谱来确定的。相反,熵分割方法能够将DNA序列划分为相对均匀、具有统计学意义的结构域。该算法的早期版本只能产生平均长度远低于典型等密度区带大小的结构域。在此我们表明,一种经过改进的分割方法,专门用于确定序列在每个尺度上最具统计学意义的划分,能够识别出显示等密度区带典型特征的长均匀基因组区域之间的边界。该算法精确地定位了人类主要组织相容性复合体区域的II类和III类,这是两个在序列水平上特征明确的等密度区带,它们之间的边界是在序列水平上通过实验表征第一个等密度区带边界。然后将分析扩展到一组人类大的重叠群。我们发现的相对均匀区域显示出通过DNA离心鉴定的等密度区带的许多特征(G+C范围、等密度区带类别的相对比例、大小分布以及与基因密度的关系)。然后为所有已完成测序的真核生物基因组绘制了等密度区带染色体图谱,这些图谱在基因组学中有许多潜在应用。