Fundación Instituto Leloir, Buenos Aires, Argentina.
Instituto de Investigaciones Bioquímicas de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina.
PLoS Comput Biol. 2023 Oct 13;19(10):e1011540. doi: 10.1371/journal.pcbi.1011540. eCollection 2023 Oct.
In eukaryotic organisms the ensemble of 5' splice site sequences reflects the balance between natural nucleotide variability and minimal molecular constraints necessary to ensure splicing fidelity. This compromise shapes the underlying statistical patterns in the composition of donor splice site sequences. The scope of this study was to mine conserved and divergent signals in the composition of 5' splice site sequences. Because 5' donor sequences are a major cue for proper recognition of splice sites, we reasoned that statistical regularities in their composition could reflect the biological functionality and evolutionary history associated with splicing mechanisms. Results: We considered a regularized maximum entropy modeling framework to mine for non-trivial two-site correlations in donor sequence datasets corresponding to 30 different eukaryotes. For each analyzed species, we identified minimal sets of two-site coupling patterns that were able to replicate, at a given regularization level, the observed one-site and two-site frequencies in donor sequences. By performing a systematic and comparative analysis of 5'splice sites we showed that lineage information could be traced from joint di-nucleotide probabilities. We were able to identify characteristic two-site coupling patterns for plants and animals, and propose that they may echo differences in splicing regulation previously reported between these groups.
在真核生物中,5'剪接位点序列的整体反映了自然核苷酸变异性与确保剪接保真度所需的最小分子约束之间的平衡。这种妥协塑造了供体位点序列组成中的基础统计模式。本研究的目的是挖掘 5'剪接位点序列组成中的保守和分歧信号。由于 5'供体位点是正确识别剪接位点的主要线索,我们推断其组成中的统计规律可以反映与剪接机制相关的生物学功能和进化历史。结果:我们考虑了正则化最大熵建模框架,以挖掘对应于 30 种不同真核生物的供体位点数据集的非平凡双位点相关。对于每个分析的物种,我们确定了最小的双位点耦合模式集,这些模式集能够在给定的正则化水平上复制供体位点序列中观察到的单一位点和双一位点频率。通过对 5'剪接位点进行系统和比较分析,我们表明谱系信息可以从联合二核苷酸概率中追溯。我们能够为植物和动物识别出特征性的双位点耦合模式,并提出它们可能反映了之前报道的这两个群体之间剪接调控的差异。