McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
Nucleic Acids Res. 2010 Aug;38(15):e158. doi: 10.1093/nar/gkq532. Epub 2010 Jun 22.
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, D(JS), using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.
有人认为哺乳动物基因组主要由长的组成均一的域组成。通常使用基于 Jensen-Shannon 散度的递归分段算法来识别这样的域。然而,此类方法的一个常见困难是确定何时停止递归分区以及使用什么标准来确定两个段之间检测到的边界是真实的还是虚假的。我们证明了常用的停止标准本质上存在偏差,并提出了 IsoPlotter,这是一种无参数的分段算法,通过使用简单的动态停止标准来克服这些偏差,并测试推断出的域的同质性。我们使用两组模拟基因组序列将 IsoPlotter 与另一种分段算法 D(JS)进行了比较。我们的结果表明,IsoPlotter 能够推断出具有低 GC 含量分散的长的和短的组成均一的域,而 D(JS)未能识别短的组成均一的域和组成分散性低的序列。通过用 IsoPlotter 分割人类基因组,我们发现三分之一的基因组由组成非均一的域组成,其余的是许多短的组成均一的域和相对较少的长的域的混合物。