同时匹配多个基序的分数分布。

Score distributions for simultaneous matching to multiple motifs.

作者信息

Bailey T L, Gribskov M

机构信息

San Diego Supercomputer Center, California 92186-9784, USA.

出版信息

J Comput Biol. 1997 Spring;4(1):45-59. doi: 10.1089/cmb.1997.4.45.

DOI:10.1089/cmb.1997.4.45

PMID:9109037

Abstract

Several computer algorithms now exist for discovering multiple motifs (expressed as weight matrices) that characterize a family of protein sequences known to be homologous. This paper describes a method for performing similarity searches of protein sequence databases using such a group of motifs. By simultaneously using all the motifs that characterize a protein family, the sensitivity and specificity of the database search are increased. We define the p-value for a target sequence to be the probability of a random sequence of the same length scoring as well or better in comparison to all the motifs that characterize the family. (The p-value of a database search can be determined from this value and the size of the database.) We show that estimating the distribution of single motif scores by a Gaussian extreme value distribution is insufficiently accurate to provide a useful estimate of the p-value, but that this deficiency can be corrected by reestimating the parameters of the underlying Gaussian distribution from observed scores for comparison of a given motif and sequence database. These parameters are used to calculate a "reduced variate" which has a Gumbel limiting distribution. Multiple motif scores are combined to give a single p-value by using the sum of the reduced variates for the motif scores as the test statistic. We give a computationally efficient approximation to the distribution of the sum of independent Gumbel random variables and verify experimentally that it closely approximates the distribution of the test statistic. Experiments on pseudorandom sequences show that the approximated p-values are conservative, so the significance of high scores in database searches will not be overstated. Experiments with real protein sequences and motifs identified by the MEME algorithm show that determining an overall p-value based on the combination of multiple motifs gives significantly better database search results than using p-values of single motifs.

摘要

现在有几种计算机算法可用于发现多个基序（表示为权重矩阵），这些基序可表征已知同源的蛋白质序列家族。本文描述了一种使用这样一组基序对蛋白质序列数据库进行相似性搜索的方法。通过同时使用表征蛋白质家族的所有基序，数据库搜索的灵敏度和特异性得以提高。我们将目标序列的p值定义为与表征该家族的所有基序相比，相同长度的随机序列得分相同或更高的概率。（数据库搜索的p值可根据此值和数据库大小确定。）我们表明，通过高斯极值分布估计单个基序得分的分布不足以准确提供p值的有用估计，但通过根据给定基序与序列数据库比较的观察得分重新估计基础高斯分布的参数，可以纠正这一缺陷。这些参数用于计算具有耿贝尔极限分布的“约化变量”。通过使用基序得分的约化变量之和作为检验统计量，将多个基序得分组合以给出单个p值。我们给出了独立耿贝尔随机变量之和分布的计算高效近似，并通过实验验证它与检验统计量的分布非常接近。对伪随机序列的实验表明，近似p值是保守的，因此不会高估数据库搜索中高分的显著性。使用MEME算法鉴定的真实蛋白质序列和基序进行的实验表明，基于多个基序的组合确定总体p值比使用单个基序的p值能给出明显更好的数据库搜索结果。

相似文献

Score distributions for simultaneous matching to multiple motifs.同时匹配多个基序的分数分布。

J Comput Biol. 1997 Spring;4(1):45-59. doi: 10.1089/cmb.1997.4.45.

Methods and statistics for combining motif match scores.用于合并基序匹配分数的方法和统计

J Comput Biol. 1998 Summer;5(2):211-21. doi: 10.1089/cmb.1998.5.211.

Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases.使用串联质谱数据和蛋白质氨基酸序列数据库进行蛋白质验证的统计模型。

Anal Chem. 2004 Mar 15;76(6):1664-71. doi: 10.1021/ac035112y.

Calibrating E-values for hidden Markov models using reverse-sequence null models.使用反向序列空模型校准隐马尔可夫模型的E值。

Bioinformatics. 2005 Nov 15;21(22):4107-15. doi: 10.1093/bioinformatics/bti629. Epub 2005 Aug 25.

The value of prior knowledge in discovering motifs with MEME.先验知识在使用MEME发现基序中的价值。

Proc Int Conf Intell Syst Mol Biol. 1995;3:21-9.

Convergent Island Statistics: a fast method for determining local alignment score significance.收敛岛统计：一种确定局部比对得分显著性的快速方法。

Bioinformatics. 2005 Jun 15;21(12):2827-31. doi: 10.1093/bioinformatics/bti433. Epub 2005 Apr 7.

Relation between weight matrix and substitution matrix: motif search by similarity.权重矩阵与替换矩阵之间的关系：基于相似性的基序搜索。

Bioinformatics. 2005 Apr 1;21(7):938-43. doi: 10.1093/bioinformatics/bti090. Epub 2004 Oct 28.

A computational strategy for the prediction of functional linear peptide motifs in proteins.一种预测蛋白质中功能性线性肽基序的计算策略。

Bioinformatics. 2007 Dec 15;23(24):3297-303. doi: 10.1093/bioinformatics/btm524. Epub 2007 Oct 31.

Empirical statistical estimates for sequence similarity searches.序列相似性搜索的经验性统计估计。

J Mol Biol. 1998 Feb 13;276(1):71-84. doi: 10.1006/jmbi.1997.1525.

Some useful statistical properties of position-weight matrices.

Comput Chem. 1994 Sep;18(3):287-94. doi: 10.1016/0097-8485(94)85024-0.

引用本文的文献

An Atlas of Peroxiredoxins Created Using an Active Site Profile-Based Approach to Functionally Relevant Clustering of Proteins.利用基于活性位点谱的方法对蛋白质进行功能相关聚类创建的过氧化物酶体增殖物激活受体图谱。

PLoS Comput Biol. 2017 Feb 10;13(2):e1005284. doi: 10.1371/journal.pcbi.1005284. eCollection 2017 Feb.

Buffalo alpha S1-casein gene 5'-flanking region and its interspecies comparison.水牛α-S1-酪蛋白基因5'-侧翼区及其种间比较。

J Appl Genet. 2014 Feb;55(1):75-87. doi: 10.1007/s13353-013-0176-7. Epub 2013 Oct 19.

Experimental strategies for studying transcription factor-DNA binding specificities.研究转录因子-DNA 结合特异性的实验策略。

Brief Funct Genomics. 2010 Dec;9(5-6):362-73. doi: 10.1093/bfgp/elq023. Epub 2010 Sep 23.

A new structure-based classification of gram-positive bacteriocins.一种基于结构的革兰氏阳性细菌素分类法。

Protein J. 2010 Aug;29(6):432-9. doi: 10.1007/s10930-010-9270-4.

Genomic targets of the KRAB and SCAN domain-containing zinc finger protein 263.KRAB 和 SCAN 结构域含有锌指蛋白 263 的基因组靶标。

J Biol Chem. 2010 Jan 8;285(2):1393-403. doi: 10.1074/jbc.M109.063032. Epub 2009 Nov 2.

MEME SUITE: tools for motif discovery and searching.MEME套件：用于基序发现和搜索的工具。

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W202-8. doi: 10.1093/nar/gkp335. Epub 2009 May 20.

N-Myc regulates a widespread euchromatic program in the human genome partially independent of its role as a classical transcription factor.N-Myc在人类基因组中调控广泛的常染色质程序，部分独立于其作为经典转录因子的作用。

Cancer Res. 2008 Dec 1;68(23):9654-62. doi: 10.1158/0008-5472.CAN-08-1961.

RXLR effector reservoir in two Phytophthora species is dominated by a single rapidly evolving superfamily with more than 700 members.两种疫霉属物种中的RXLR效应子库由一个拥有700多个成员的快速进化的单一超家族主导。

Proc Natl Acad Sci U S A. 2008 Mar 25;105(12):4874-9. doi: 10.1073/pnas.0709303105. Epub 2008 Mar 14.

A comprehensive ChIP-chip analysis of E2F1, E2F4, and E2F6 in normal and tumor cells reveals interchangeable roles of E2F family members.一项针对正常细胞和肿瘤细胞中E2F1、E2F4和E2F6的全面染色质免疫沉淀芯片分析揭示了E2F家族成员的可互换作用。

Genome Res. 2007 Nov;17(11):1550-61. doi: 10.1101/gr.6783507. Epub 2007 Oct 1.

Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches.使用综合计算和实验基因组学方法鉴定OCT4和SRY调控模块。

Genome Res. 2007 Jun;17(6):807-17. doi: 10.1101/gr.6006107.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

同时匹配多个基序的分数分布。

Score distributions for simultaneous matching to multiple motifs.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献