Oliver José L, Carpena Pedro, Hackenberg Michael, Bernaola-Galván Pedro
Departamento de Genética, Instituto de Biotecnología, Facultad de Ciencias, Universidad de Granada, Spain.
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W287-92. doi: 10.1093/nar/gkh399.
Isochores are long genome segments homogeneous in G+C. Here, we describe an algorithm (IsoFinder) running on the web (http://bioinfo2.ugr.es/IsoF/isofinder.html) able to predict isochores at the sequence level. We move a sliding pointer from left to right along the DNA sequence. At each position of the pointer, we compute the mean G+C values to the left and to the right of the pointer. We then determine the position of the pointer for which the difference between left and right mean values (as measured by the t-statistic) reaches its maximum. Next, we determine the statistical significance of this potential cutting point, after filtering out short-scale heterogeneities below 3 kb by applying a coarse-graining technique. Finally, the program checks whether this significance exceeds a probability threshold. If so, the sequence is cut at this point into two subsequences; otherwise, the sequence remains undivided. The procedure continues recursively for each of the two resulting subsequences created by each cut. This leads to the decomposition of a chromosome sequence into long homogeneous genome regions (LHGRs) with well-defined mean G+C contents, each significantly different from the G+C contents of the adjacent LHGRs. Most LHGRs can be identified with Bernardi's isochores, given their correlation with biological features such as gene density, SINE and LINE (short, long interspersed repetitive elements) densities, recombination rate or single nucleotide polymorphism variability. The resulting isochore maps are available at our web site (http://bioinfo2.ugr.es/isochores/), and also at the UCSC Genome Browser (http://genome.cse.ucsc.edu/).
等密度区是基因组中G+C含量均匀的长片段。在此,我们描述了一种在网页上运行的算法(IsoFinder,网址为http://bioinfo2.ugr.es/IsoF/isofinder.html),它能够在序列水平上预测等密度区。我们沿着DNA序列从左到右移动一个滑动指针。在指针的每个位置,我们计算指针左侧和右侧的平均G+C值。然后,我们确定指针的位置,此时左右平均值之间的差异(通过t统计量测量)达到最大值。接下来,在通过应用粗粒化技术滤除3 kb以下的短尺度异质性后,我们确定这个潜在切割点的统计显著性。最后,程序检查这个显著性是否超过概率阈值。如果是,则在这一点将序列切割成两个子序列;否则,序列保持未分割状态。对于每次切割产生的两个子序列中的每一个,该过程都递归继续。这导致染色体序列分解为具有明确平均G+C含量的长均质基因组区域(LHGRs),每个区域与相邻LHGRs的G+C含量有显著差异。考虑到大多数LHGRs与基因密度、SINE和LINE(短、长散布重复元件)密度、重组率或单核苷酸多态性变异性等生物学特征的相关性,它们可以被识别为伯纳迪的等密度区。生成的等密度区图谱可在我们的网站(http://bioinfo2.ugr.es/isochores/)以及UCSC基因组浏览器(http://genome.cse.ucsc.edu/)上获取。