收敛岛统计：一种确定局部比对得分显著性的快速方法。

Convergent Island Statistics: a fast method for determining local alignment score significance.

作者信息

Poleksic Aleksandar, Danzer Joseph F, Hambly Kevin, Debe Derek A

机构信息

Eidogen-Sertanty Inc., 9381 Judicial Dr., San Diego, CA 92121, USA.

出版信息

Bioinformatics. 2005 Jun 15;21(12):2827-31. doi: 10.1093/bioinformatics/bti433. Epub 2005 Apr 7.

DOI:10.1093/bioinformatics/bti433

PMID:15817690

Abstract

MOTIVATION

Background distribution statistics for profile-based sequence alignment algorithms cannot be calculated analytically, and hence such algorithms must resort to measuring the significance of an alignment score by assessing its location among a distribution of background alignment scores. The Gumbel parameters that describe this background distribution are usually pre-computed for a limited number of scoring systems, gap schemes, and sequence lengths and compositions. The use of such look-ups is known to introduce errors, which compromise the significance assessment of a remote homology relationship. One solution is to estimate the background distribution for each pair of interest by generating a large number of sequence shuffles and use the distribution of their scores to approximate the parameters of the underlying extreme value distribution. This is computationally very expensive, as a large number of shuffles are needed to precisely estimate the score statistics.

RESULTS

Convergent Island Statistics (CIS) is a computationally efficient solution to the problem of calculating the Gumbel distribution parameters for an arbitrary pair of sequences and an arbitrary set of gap and scoring schemes. The basic idea behind our method is to recognize the lack of similarity for any pair of sequences early in the shuffling process and thus save on the search time. The method is particularly useful in the context of profile-profile alignment algorithms where the normalization of alignment scores has traditionally been a challenging task.

CONTACT

aleksandar@eidogen.com

SUPPLEMENTARY INFORMATION

http://www.eidogen-sertanty.com/Documents/convergent_island_stats_sup.pdf.

摘要

动机

基于轮廓的序列比对算法的背景分布统计无法通过解析计算得出，因此此类算法必须通过评估比对分数在背景比对分数分布中的位置来衡量其显著性。描述此背景分布的耿贝尔参数通常是针对有限数量的评分系统、空位方案以及序列长度和组成预先计算的。已知使用此类查找会引入误差，这会损害远源同源关系的显著性评估。一种解决方案是通过生成大量序列重排来估计每对感兴趣序列的背景分布，并使用它们的分数分布来近似潜在极值分布的参数。这在计算上非常昂贵，因为需要大量重排才能精确估计分数统计量。

结果

收敛岛统计（CIS）是一种计算高效的解决方案，用于计算任意一对序列以及任意空位和评分方案集的耿贝尔分布参数。我们方法背后的基本思想是在重排过程早期识别任意一对序列之间缺乏相似性，从而节省搜索时间。该方法在轮廓-轮廓比对算法的背景下特别有用，在这种算法中，比对分数的归一化传统上是一项具有挑战性的任务。