Suppr超能文献

始终全面命中:间隔种子敏感性的无参数计算

All hits all the time: parameter-free calculation of spaced seed sensitivity.

作者信息

Mak Denise Y F, Benson Gary

机构信息

Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA.

出版信息

Bioinformatics. 2009 Feb 1;25(3):302-8. doi: 10.1093/bioinformatics/btn643. Epub 2008 Dec 18.

Abstract

MOTIVATION

Standard search techniques for DNA repeats start by identifying small matching words, or seeds, that may inhabit larger repeats. Recent innovations in seed structure include spaced seeds and indel seeds which are more sensitive than contiguous seeds. Evaluating seed sensitivity requires (i) specifying a homology model for alignments and (ii) assigning probabilities to those alignments. Optimal seed selection is resource intensive because all alternative seeds must be tested. Current methods require that the model and its probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun.

RESULTS

We show how to eliminate the need for prior parameter specification by exploiting a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe an efficient preprocessing step, which is computed once per seed. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to identify optimal seeds and the boundaries within probability space where they are optimal.

摘要

动机

DNA重复序列的标准搜索技术首先要识别可能存在于较大重复序列中的小匹配词,即种子。种子结构的最新创新包括间隔种子和插入缺失种子,它们比连续种子更敏感。评估种子敏感性需要(i)为比对指定一个同源性模型,以及(ii)为这些比对分配概率。最优种子选择资源消耗大,因为必须测试所有备选种子。当前方法要求预先指定模型及其概率参数。当参数变化时,整个计算必须重新运行。

结果

我们展示了如何通过利用一个简单的观察结果来消除对预先指定参数的需求:给定一个同源性模型,特定种子命中的比对无论概率参数如何都保持不变。只是分配给这些比对的权重会改变。因此,如果我们知道所有命中情况,就可以轻松(且快速)地找到最优种子。我们描述了一个高效的预处理步骤,每个种子只需计算一次。然后我们展示了几种在给定特定概率参数时越来越高效的方法来找到最优种子。实际上,我们展示了如何确切地确定哪些种子在任何一组概率参数下都永远不可能是最优的。这导致了一个惊人的发现,在数千个种子中,只有少数有机会成为最优种子。然后我们展示了如何识别最优种子以及它们在概率空间中最优的边界。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验