Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan.
Bioinformatics. 2011 Sep 1;27(17):2346-53. doi: 10.1093/bioinformatics/btr420. Epub 2011 Jul 14.
Measuring evolutionary conservation is a routine step in the identification of functional elements in genome sequences. Although a number of studies have proposed methods that use the continuous time Markov models (CTMMs) to find evolutionarily constrained elements, their probabilistic structures have been less frequently investigated.
In this article, we investigate a sufficient statistic for CTMMs. The statistic is composed of the fractional duration of nucleotide characters over evolutionary time, F(d), and the number of substitutions occurring in phylogenetic trees, N(s). We first derive basic properties of the sufficient statistic. Then, we derive an expectation maximization (EM) algorithm for estimating the parameters of a phylogenetic model, which iteratively computes the expectation values of the sufficient statistic. We show that the EM algorithm exhibits much faster convergence than other optimization methods that use numerical gradient descent algorithms. Finally, we investigate the genome-wide distribution of fractional duration time F(d) which, unlike the number of substitutions N(s), has rarely been investigated. We show that F(d) has evolutionary information that is distinct from that in N(s), which may be useful for detecting novel types of evolutionary constraints existing in the human genome.
The C++ source code of the 'Fdur' software is available at http://www.ncrna.org/software/fdur/
Supplementary data are available at Bioinformatics online.
在基因组序列中识别功能元件是衡量进化保守性的常规步骤。尽管已经有许多研究提出了使用连续时间马尔可夫模型(CTMM)来寻找进化受约束的元件的方法,但它们的概率结构却很少被研究。
在本文中,我们研究了 CTMM 的一个充分统计量。该统计量由核苷酸字符在进化时间上的分数持续时间 F(d)和系统发育树中发生的替换数 N(s)组成。我们首先推导了充分统计量的基本性质。然后,我们推导出了一种用于估计系统发育模型参数的期望最大化(EM)算法,该算法通过迭代计算充分统计量的期望值得出。我们表明,EM 算法比使用数值梯度下降算法的其他优化方法具有更快的收敛速度。最后,我们研究了分数持续时间 F(d)的全基因组分布,与替换数 N(s)不同,F(d)很少被研究。我们表明,F(d)具有与 N(s)不同的进化信息,这可能有助于检测人类基因组中存在的新型进化约束类型。
“Fdur”软件的 C++源代码可在 http://www.ncrna.org/software/fdur/ 获得。
补充数据可在生物信息学在线获得。