Miklós István
Department of Statistics, University of Oxford, Oxford, UK.
Bioinformatics. 2003 Oct;19 Suppl 2:ii130-7. doi: 10.1093/bioinformatics/btg1070.
As more and more genomes have been sequenced, genomic data is rapidly accumulating. Genome-wide mutations are believed more neutral than local mutations such as substitutions, insertions and deletions, therefore phylogenetic investigations based on inversions, transpositions and inverted transpositions are less biased by the hypothesis on neutral evolution. Although efficient algorithms exist for obtaining the inversion distance of two signed permutations, there is no reliable algorithm when both inversions and transpositions are considered. Moreover, different type of mutations happen with different rates, and it is not clear how to weight them in a distance based approach.
We introduce a Markov Chain Monte Carlo method to genome rearrangement based on a stochastic model of evolution, which can estimate the number of different evolutionary events needed to sort a signed permutation. The performance of the method was tested on simulated data, and the estimated numbers of different types of mutations were reliable. Human and Drosophila mitochondrial data were also analysed with the new method. The mixing time of the Markov Chain is short both in terms of CPU times and number of proposals.
The source code in C is available on request from the author.
随着越来越多的基因组被测序,基因组数据正在迅速积累。全基因组突变被认为比诸如替换、插入和缺失等局部突变更具中性,因此基于倒位、转座和反向转座的系统发育研究受中性进化假设的影响较小。尽管存在用于获取两个有符号排列的倒位距离的有效算法,但在同时考虑倒位和转座时,却没有可靠的算法。此外,不同类型的突变以不同的速率发生,并且不清楚如何在基于距离的方法中对它们进行加权。
我们基于一种随机进化模型,引入了一种用于基因组重排的马尔可夫链蒙特卡罗方法,该方法可以估计将一个有符号排列排序所需的不同进化事件的数量。该方法的性能在模拟数据上进行了测试,并且估计的不同类型突变的数量是可靠的。还使用这种新方法分析了人类和果蝇的线粒体数据。马尔可夫链的混合时间在CPU时间和提议数量方面都很短。
可根据作者要求提供C语言的源代码。