Haiminen Niina, Mannila Heikki
HIIT Basic Research Unit, Department of Computer Science, University of Helsinki, Finland.
Gene. 2007 Jun 1;394(1-2):53-60. doi: 10.1016/j.gene.2007.01.028. Epub 2007 Feb 16.
The isochore structure of a genome is observable by variation in the G+C (guanine and cytosine) content within and between the chromosomes. Describing the isochore structure of vertebrate genomes is a challenging task, and many computational methods have been developed and applied to it. Here we apply a well-known least-squares optimal segmentation algorithm to isochore discovery. The algorithm finds the best division of the sequence into k pieces, such that the segments are internally as homogeneous as possible. We show how this simple segmentation method can be applied to isochore discovery using as input the G+C content of sliding windows on the sequence. To evaluate the performance of this segmentation technique on isochore detection, we present results from segmenting previously studied isochore regions of the human genome. Detailed results on the MHC locus, on parts of chromosomes 21 and 22, and on a 100 Mb region from chromosome 1 are similar to previously suggested isochore structures. We also give results on segmenting all 22 autosomal human chromosomes. An advantage of this technique is that oversegmentation of G+C rich regions can generally be avoided. This is because the technique concentrates on greater global, instead of smaller local, differences in the sequence composition. The effect is further emphasized by a log-transformation of the data that lowers the high variance that is observed in G+C rich regions. We conclude that the least-squares optimal segmentation method is computationally efficient and yields results close to previous biologically motivated isochore structures.
通过染色体内部和之间鸟嘌呤与胞嘧啶(G+C)含量的变化,可以观察到基因组的等容线结构。描述脊椎动物基因组的等容线结构是一项具有挑战性的任务,并且已经开发了许多计算方法并将其应用于此。在这里,我们将一种著名的最小二乘最优分割算法应用于等容线发现。该算法找到将序列最佳划分为k段的方法,以使各段在内部尽可能均匀。我们展示了如何将这种简单的分割方法应用于等容线发现,使用序列上滑动窗口的G+C含量作为输入。为了评估这种分割技术在等容线检测上的性能,我们给出了对人类基因组先前研究的等容线区域进行分割的结果。关于主要组织相容性复合体(MHC)基因座、21号和22号染色体部分以及1号染色体上一个100兆碱基区域的详细结果与先前提出的等容线结构相似。我们还给出了对所有22条人类常染色体进行分割的结果。这种技术的一个优点是通常可以避免富含G+C区域的过度分割。这是因为该技术关注的是序列组成中更大的全局差异,而不是更小的局部差异。通过对数据进行对数变换降低了富含G+C区域中观察到的高方差,这进一步强调了这种效果。我们得出结论,最小二乘最优分割方法计算效率高,并且产生的结果与先前基于生物学动机的等容线结构接近。