基于稀有事件抽样的位置相关评分的局部序列比对的精确统计。

Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.

机构信息

Laboratoire MAP5 (UMR CNRS 8145), Université Paris Descartes, Paris, France.

出版信息

BMC Bioinformatics. 2011 Feb 3;12:47. doi: 10.1186/1471-2105-12-47.

DOI:10.1186/1471-2105-12-47

PMID:21291566

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3042914/

Abstract

BACKGROUND

Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.

RESULTS

In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach.

CONCLUSIONS

The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.

摘要

背景

分子数据库搜索工具需要统计模型来评估结果命中的显著性。在经典方法中，人们会问一个问题，即某个特定分数是纯偶然观察到的概率有多大。对于这种问题的渐近理论适用于两个随机独立同分布的序列。已经做了一些努力来包括有限序列长度的影响，并考虑序列的特定组成。在许多应用中，例如跨膜蛋白的大规模数据库同源搜索，这些模型并不是最合适的。搜索的敏感性和特异性受益于基于位置的评分方案或使用隐马尔可夫模型。此外，人们可能希望超越序列是独立同分布的假设。尽管它们具有实际重要性，但这些设置的统计性质尚未得到很好的研究。

结果

在本文中，我们讨论了一种计算任何所需精度的分数分布的有效且通用的方法。该通用方法可应用于不同的序列模型和满足一些弱假设的各种相似性度量。我们可以访问分布的低概率区域（“尾部”），其中分数大于纯偶然的预期，因此与实际应用相关。我们的方法使用了来自稀有事件模拟的最新思想，将马尔可夫链蒙特卡罗模拟与重要性抽样和广义集合相结合。我们给出了针对随机序列的固定查询和随机查询的分数统计结果。在第二步中，我们将方法扩展到跨膜蛋白模型，该模型很难描述为独立同分布序列。对于这种情况，我们比较了固定查询模型和隐马尔可夫序列模型的统计特性，以及与基于位置的评分方案的经典方法。