Li W
Laboratory of Statistical Genetics, The Rockefeller University, 1230 York Avenue, Box 192, New York, NY 10021, USA.
Gene. 2001 Oct 3;276(1-2):57-72. doi: 10.1016/s0378-1119(01)00672-2.
The concept of homogeneity of G+C content is always relative and subjective. This point is emphasized and quantified in this paper using a simple example of one sequence segmented into two subsequences. Whether the sequence is homogeneous or not can be answered by whether the two-subsequence model describes the DNA sequence better than the one-sequence model. There are at least three equivalent ways of looking at the 1-to-2 segmentation: Jensen-Shannon divergence measure, log likelihood ratio test, and model selection using Bayesian information criterion. Once a criterion is chosen, a DNA sequence can be recursively segmented into multiple domains. We use one subjective criterion called segmentation strength based on the Bayesian information criterion. Whether or not a sequence is homogeneous and how many domains it has depend on this criterion. We compare six different genome sequences (yeast S. cerevisiae chromosome III and IV, bacterium M. pneumoniae, human major histocompatibility complex sequence, longest contigs in human chromosome 21 and 22) by recursive segmentations at different strength criteria. Results by recursive segmentation confirm that yeast chromosome IV is more homogeneous than yeast chromosome III, human chromosome 21 is more homogeneous than human chromosome 22, and bacterial genomes may not be homogeneous due to short segments with distinct base compositions. The recursive segmentation also provides a quantitative criterion for identifying isochores in human sequences. Some features of our recursive segmentation, such as the possibility of delineating domain borders accurately, are superior to those of the moving-window approach commonly used in such analyses.
G+C含量均匀性的概念始终是相对的和主观的。本文通过一个将一个序列分割为两个子序列的简单例子来强调这一点并进行量化。一个序列是否均匀可以通过双序列模型是否比单序列模型更好地描述DNA序列来回答。至少有三种等效的方式来看待1对2的分割:詹森-香农散度度量、对数似然比检验以及使用贝叶斯信息准则的模型选择。一旦选择了一个准则,DNA序列就可以递归地分割为多个结构域。我们基于贝叶斯信息准则使用一种称为分割强度的主观准则。一个序列是否均匀以及它有多少个结构域取决于这个准则。我们通过在不同强度准则下的递归分割来比较六个不同的基因组序列(酵母酿酒酵母的III号和IV号染色体、肺炎支原体、人类主要组织相容性复合体序列、人类21号和22号染色体中最长的重叠群)。递归分割的结果证实,酵母IV号染色体比酵母III号染色体更均匀,人类21号染色体比人类22号染色体更均匀,并且由于具有不同碱基组成的短片段,细菌基因组可能不均匀。递归分割还为识别人类序列中的等密度区提供了一个定量标准。我们的递归分割的一些特征,例如准确描绘结构域边界的可能性,优于此类分析中常用的移动窗口方法。