Koslicki David, Thompson Daniel J
Department of Mathematics, Oregon State University, 368 Kidder Hall, Corvallis, OR , 97330, USA,
J Math Biol. 2015 Jan;70(1-2):45-69. doi: 10.1007/s00285-014-0754-2. Epub 2014 Jan 22.
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/ .
我们基于拓扑压力提出了一种基因组分析中编码序列(CDS)密度估计的新方法,该方法是我们从遍历理论中的一个著名概念发展而来的。拓扑压力衡量有限字的“加权信息含量”,并包含64个参数,这些参数可解释为每个核苷酸三联体的权重选择。我们训练这些参数,使拓扑压力与人类基因组上观察到的编码序列密度相匹配,并以此对小家鼠、恒河猴和黑腹果蝇基因组上大小约为66,000 bp的窗口内的CDS密度进行从头预测。虽然这些基因组之间的差异太大,以至于期望在人类基因组上进行训练能够预测,例如,基因的确切位置是不现实的,但我们证明我们的方法对于预测CDS密度的“粗粒度”问题给出了合理的估计。再次受到遍历理论的启发,从我们的训练过程中获得的核苷酸三联体的权重用于定义有限序列上的概率分布,该分布可用于区分人类基因组中长度在750到5,000 bp之间的内含子和外显子序列。在本文结尾,我们解释了我们方法的理论基础,即来自动力系统文献的热力学形式理论。我们方法的Mathematica和MATLAB实现可在http://sourceforge.net/projects/topologicalpres/获取。