Bernaola-Galván Pedro, Carpena Pedro, Gómez-Martín Cristina, Oliver Jose L
Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain.
Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands.
Biology (Basel). 2023 Jun 13;12(6):849. doi: 10.3390/biology12060849.
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
由于基因组承载着一个物种生物和环境相互作用的历史信息,运用强大的统计物理方法(如熵分割算法、DNA 步移中的涨落分析或组成复杂性度量)来分析基因组结构随时间的变化,能为基因组进化提供有价值的见解。核苷酸频率往往沿 DNA 链变化,导致染色体结构呈现分层的斑块状,在从几个核苷酸到数千万个核苷酸的不同长度尺度上存在异质性。涨落分析表明,这些组成结构可分为三大类:(1)短程异质性(低于几千碱基对(Kbp)),主要归因于编码区和非编码区的交替、散布或串联重复序列密度等;(2)同线区,跨度为数十到数百 Kbp;(3)超结构,大小达到数十兆碱基对(Mbp)甚至更大。在首个完整的端粒到端粒(T2T)人类序列中获得的同线区和超结构坐标现已在公共数据库中共享。这样,感兴趣的研究人员可以使用 T2T 同线区数据以及不同基因组元件的注释,来检验关于基因组结构的特定假设。与生物组织的其他层次类似,基因组中普遍存在分层的组成结构。一旦确定了基因组的组成结构,就可以导出各种度量来量化这种结构的异质性。最近有人提出片段 G+C 含量的分布作为一种新的基因组特征,事实证明它有助于比较完整基因组。另一个有意义的度量是序列组成复杂性(SCC),它已被用于基因组结构比较。最后,我们回顾了最近通过对蓝藻门古老物种进行 SCC 随时间的系统发育回归所做的基因组比较,这些比较揭示了基因组复杂性增加的积极趋势。这些发现为基因组组成结构的驱动性渐进进化提供了首个证据。