UCD Complex and Adaptive Systems Laboratory, University College Dublin, Dublin, Ireland.
BMC Bioinformatics. 2010 Jan 7;11:14. doi: 10.1186/1471-2105-11-14.
Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.
A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.
A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.
蛋白质相互作用的大型数据集为发现重复出现在不相关蛋白质中的短线性基序(SLiM)提供了丰富的资源。然而,用于估计基序重复概率的现有方法可能会受到搜索数据集的大小和组成的影响,因此来自不同数据集或包含不同数量非通配位置的基序的 p 值估计值并不完全可比。在这里,我们开发了更精确的方法,并探讨了计算效率高的近似方法的潜在偏差。
计算基序过度表示的一种广泛使用的启发式方法通过假设所有蛋白质具有相同的长度和组成来近似基序概率。我们引入 pv,它可以准确地计算概率。其次,最近引入的 SLiMFinder 统计量 Sig 在基序发现中考虑了多重检验(针对所有可能的基序)。然而,它将所有其他可能的基序(得分 p 或更低)的概率近似为 p。在这里,我们表明,可以通过对与感兴趣的基序一样罕见或更罕见的所有可能基序的出现概率进行穷举计算来有效地计算 Sig',即通过对具有相同残基排列的基序进行分组。Sig'v 纠正了这两个近似值,当在随机数据集中搜索非歧义基序时,它均匀分布,表明它是一种稳健的显著度量。
提出了一种方法来准确计算非歧义短蛋白质序列基序的真实概率,并展示了一种用于在大量数据集上进行新基序发现的近似方法的实用性。