Suppr超能文献

识别重复序列内的一致突变片段。

Identifying uniformly mutated segments within repeats.

作者信息

Sahinalp S Cenk, Eichler Evan, Goldberg Paul, Berenbrink Petra, Friedetzky Tom, Ergun Funda

机构信息

School of Computing Science, Simon Fraser University, Canada.

出版信息

J Bioinform Comput Biol. 2004 Dec;2(4):657-68. doi: 10.1142/s0219720004000788.

Abstract

Given a long string of characters from a constant size alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible n-coin models for generating a binary string S, where each bit of S is generated via an independent toss of one of the n coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a procedure to evaluate the likelihood of a n-coin model for given S, subject a uniform prior distribution over the parameters of the model (that represent mutation rates and probabilities of copying events). In the absence of detailed prior knowledge of these parameters, the algorithm can be used to determine whether the a posteriori probability for n=1 is higher than for any other n>1. Our algorithm runs in time O(l4logl), where l is the length of S, through a dynamic programming approach which exploits the assumed convexity of the a posteriori probability for n. Our test can be used in the analysis of long alignments between pairs of genomic sequences in a number of ways. For example, functional regions in genome sequences exhibit much lower mutation rates than non-functional regions. Because our test provides means for determining variations in the mutation rate, it may be used to distinguish functional regions from non-functional ones. Another application is in determining whether two highly similar, thus evolutionarily related, genome segments are the result of a single copy event or of a complex series of copy events. This is particularly an issue in evolutionary studies of genome regions rich with repeat segments (especially tandemly repeated segments).

摘要

给定一个来自固定大小字母表的长字符序列,我们提出一种算法来确定其字符是否由单个独立同分布(i.i.d.)随机源生成。更具体地说,考虑用于生成二进制字符串S的所有可能的n硬币模型,其中S的每一位是通过模型中n个硬币之一的独立抛掷生成的。选择抛掷哪个硬币由硬币集合上的随机游走决定,其中硬币改变的概率远低于重复使用同一硬币的概率。我们提出一种程序,在模型参数(表示突变率和复制事件概率)的均匀先验分布下,评估给定S时n硬币模型的似然性。在缺乏这些参数的详细先验知识的情况下,该算法可用于确定n = 1时的后验概率是否高于任何其他n>1时的后验概率。我们的算法通过动态规划方法在时间O(l4logl)内运行,该方法利用了n的后验概率的假定凸性。我们的测试可以通过多种方式用于分析基因组序列对之间的长比对。例如,基因组序列中的功能区域比非功能区域表现出低得多的突变率。因为我们的测试提供了确定突变率变化的方法,所以它可用于区分功能区域和非功能区域。另一个应用是确定两个高度相似、因此具有进化关系的基因组片段是单个复制事件还是一系列复杂复制事件的结果。这在富含重复片段(特别是串联重复片段)的基因组区域的进化研究中尤其重要。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验