Department of Biology & Biochemistry, University of Houston, TX, USA.
Mol Biol Evol. 2010 May;27(5):1015-24. doi: 10.1093/molbev/msp307. Epub 2009 Dec 16.
Numerous segmentation methods for the detection of compositionally homogeneous domains within genomic sequences have been proposed. Unfortunately, these methods yield inconsistent results. Here, we present a benchmark consisting of two sets of simulated genomic sequences for testing the performances of segmentation algorithms. Sequences in the first set are composed of fixed-sized homogeneous domains, distinct in their between-domain guanine and cytosine (GC) content variability. The sequences in the second set are composed of a mosaic of many short domains and a few long ones, distinguished by sharp GC content boundaries between neighboring domains. We use these sets to test the performance of seven segmentation algorithms in the literature. Our results show that recursive segmentation algorithms based on the Jensen-Shannon divergence outperform all other algorithms. However, even these algorithms perform poorly in certain instances because of the arbitrary choice of a segmentation-stopping criterion.
已经提出了许多用于检测基因组序列中成分均匀域的分割方法。不幸的是,这些方法的结果并不一致。在这里,我们提出了一个基准,包括两组模拟基因组序列,用于测试分割算法的性能。第一组序列由固定大小的均匀域组成,在它们的域间鸟嘌呤和胞嘧啶(GC)含量变化方面是不同的。第二组序列由许多短域和几个长域的镶嵌组成,通过相邻域之间的 GC 含量边界的急剧变化来区分。我们使用这两组序列来测试文献中七种分割算法的性能。我们的结果表明,基于 Jensen-Shannon 散度的递归分割算法优于所有其他算法。然而,即使是这些算法在某些情况下也表现不佳,因为分割停止标准的任意选择。