Department of Physics and Astronomy, University of Bologna, Bologna, Italy.
Center for Complex Network Research and Physics Department, Northeastern University, Boston, MA, USA.
BMC Bioinformatics. 2018 Oct 15;19(Suppl 10):355. doi: 10.1186/s12859-018-2303-2.
Statistical approaches to genetic sequences have revealed helpful to gain deeper insight into biological and structural functionalities, using ideas coming from information theory and stochastic modelling of symbolic sequences. In particular, previous analyses on CG dinucleotide position along the genome allowed to highlight its epigenetic role in DNA methylation, showing a different distribution tail as compared to other dinucleotides. In this paper we extend the analysis to the whole CG distance distribution over a selected set of higher-order organisms. Then we apply the best fitting probability density function to a large range of organisms (>4400) of different complexity (from bacteria to mammals) and we characterize some emerging global features.
We find that the Gamma distribution is optimal for the selected subset as compared to a group of several distributions, chosen for their physical meaning or because recently used in literature for similar studies. The parameters of this distribution, when applied to our larger set of organisms, allows to highlight some biologically relavant features for the considered organism classes, that can be useful also for classification purposes.
The quantification of statistical properties of CG dinucleotide positioning along the genome is confirmed as a useful tool to characterize broad classes of organisms, spanning the whole range of biological complexity.
遗传序列的统计方法已经揭示了利用信息理论和符号序列的随机建模思想,有助于深入了解生物和结构功能。特别是,以前对基因组中 CG 二核苷酸位置的分析表明,它在 DNA 甲基化中具有表观遗传作用,与其他二核苷酸相比,其分布尾部不同。在本文中,我们将分析扩展到选定的一组高等生物的整个 CG 距离分布。然后,我们将最佳拟合概率密度函数应用于范围广泛的不同复杂性(从细菌到哺乳动物)的生物体(>4400 个),并描述一些新兴的全局特征。
与为类似研究选择的物理意义或最近在文献中使用的一组分布相比,我们发现 Gamma 分布对于所选子集是最佳的。当将该分布的参数应用于我们更大的生物体集合时,允许突出考虑的生物体类别的一些与生物学相关的特征,这些特征对于分类目的也可能是有用的。
沿基因组 CG 二核苷酸定位的统计特性的量化被证实是一种有用的工具,可用于表征跨越整个生物复杂性范围的广泛的生物体类别。