Suppr超能文献

迈向空位比对的精确统计。

Toward an accurate statistics of gapped alignments.

作者信息

Kschischo Maik, Lässig Michael, Yu Yi-Kuo

机构信息

University of Applied Sciences Koblenz, RheinAhrCampus Remagen, Südallee 2, 53424 Remagen, Germany.

出版信息

Bull Math Biol. 2005 Jan;67(1):169-91. doi: 10.1016/j.bulm.2004.07.001.

Abstract

Sequence alignment has been an invaluable tool for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13-140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S(max), resp. the maximum free energy F(max), for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S(max)>x) approximately exp(-lambdax) for maximum-score alignment and P(F(max)>x) approximately exp(-lambdax) for some classes of probabilistic alignment. We derive an exact expression for lambda for particular probabilistic alignments. This result is then used to obtain accurate lambda values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions.

摘要

序列比对一直是寻找同源序列的宝贵工具。所发现同源性的显著性通常通过p值进行统计学量化。对于无间隙比对,存在计算p值的理论[卡林,S.,阿尔茨舒尔,S.F.,1990年。使用通用评分方案评估分子序列特征统计学显著性的方法。美国国家科学院院刊87,2264 - 2268;卡林,S.,登博,A.,1992年。马尔可夫相关部分和中最大片段得分的极限分布。应用概率进展24,13 - 140],但对有间隙比对的全面推广尚未完成。我们对两种常见的序列比较算法进行了统一的统计分析:最大得分(史密斯 - 沃特曼)比对及其广义概率对应算法,包括最大似然比对和隐马尔可夫模型。这些算法最重要的统计特征是最大得分S(max)的分布函数,分别对应于相互不相关随机序列的最大自由能F(max)。根据经验,这种分布已知为冈贝尔形式,对于最大得分比对,P(S(max)>x)近似为exp(-λx),对于某些类型的概率比对,P(F(max)>x)近似为exp(-λx)。我们推导出了特定概率比对的λ的精确表达式。然后利用这个结果获得通用概率比对和最大得分比对的准确λ值。尽管所展示的结果使用了简单的匹配 - 错配评分系统,但预计它将是更通用评分函数的一个良好起点。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验