Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
School of Economics and Management, Inner Mongolia University of Science and Technology, Baotou, 014010, China.
BMC Genomics. 2024 Sep 12;25(1):855. doi: 10.1186/s12864-024-10786-1.
Studying the composition rules and evolution mechanisms of genome sequences are core issues in the post-genomic era, and k-mer spectrum analysis of genome sequences is an effective means to solve this problem.
We divided total 8-mers of genome sequences into 16 kinds of XY-type due to XY dinucleotides number in 8-mers. Previous works explored that the independent unimodal distributions observed only in three CG-type 8-mer spectra, while non-CG type 8-mer spectra have not the universal phenomenon from prokaryotes to eukaryotes. On this basis, we analyzed the distribution variation of non-CG type 8-mer spectra across 889 animal genome sequences. Following the evolutionary order of animals from primitive to more complex, we found that the spectrum distributions gradually transition from unimodal to tri-modal. The relative distance from the average frequency of each non-CG type 8-mers to the center frequency is different within a species and among different species. For the 8-mers contain CG dinucleotides, we further divided these into 16 subsets, where each 8-mer contains both CG and XY dinucleotides, called XY1_CG1 subsets. We found that the separability values of XY1_CG1 spectra are closely related to the evolution and specificity of animals. Considering the constraint of Chargaff's second parity rule, we finally obtained 10 separability values as the feature set to characterize the evolution state of genome sequences. In order to verify the rationality of the feature set, we used 14 common classification algorithms to perform binary classification tests. The results showed that the accuracy (Acc) ranged between 98.70% and 83.88% among birds, other vertebrates and mammals.
We proposed a credible feature set to characterizes the evolution state of genomes and obtained satisfied results by the feature set on large scale classification of animals.
研究基因组序列的组成规则和演化机制是后基因组时代的核心问题,而基因组序列的 k-mer 频谱分析是解决这一问题的有效手段。
我们根据 8-mer 中的 XY 二核苷酸数量将总 8-mer 分为 16 种 XY 型。以前的工作发现,只有在三种 CG 型 8-mer 谱中观察到独立的单峰分布,而在非 CG 型 8-mer 谱中,从原核生物到真核生物都没有普遍现象。在此基础上,我们分析了 889 种动物基因组序列中非 CG 型 8-mer 谱的分布变化。按照动物从原始到更复杂的进化顺序,我们发现谱分布逐渐从单峰过渡到三峰。每个非 CG 型 8-mer 的相对频率与中心频率的平均频率的距离在同一物种内和不同物种之间是不同的。对于包含 CG 二核苷酸的 8-mer,我们进一步将其分为 16 个子集,每个 8-mer 都包含 CG 和 XY 二核苷酸,称为 XY1_CG1 子集。我们发现,XY1_CG1 谱的可分离性值与动物的进化和特异性密切相关。考虑到Chargaff 第二碱基对规则的约束,我们最终得到了 10 个可分离性值作为特征集,以表征基因组序列的进化状态。为了验证特征集的合理性,我们使用 14 种常见的分类算法对二进制分类测试进行了分析。结果表明,在鸟类、其他脊椎动物和哺乳动物中,准确率(Acc)在 98.70%到 83.88%之间。
我们提出了一个可信的特征集来描述基因组的进化状态,并通过该特征集在大规模的动物分类中获得了令人满意的结果。