Mathura Venkatarajan S, Schein Catherine H, Braun Werner
Sealy Center for Structural Biology, Department of Human Biological Chemistry and Genetics, University of Texas Medical Branch, Galveston, TX 77555-1157, USA.
Bioinformatics. 2003 Jul 22;19(11):1381-90. doi: 10.1093/bioinformatics/btg164.
Identification of short conserved sequence motifs common to a protein family or superfamily can be more useful than overall sequence similarity in suggesting the function of novel gene products. Locating motifs still requires expert knowledge, as automated methods using stringent criteria may not differentiate subtle similarities from statistical noise.
We have developed a novel automatic method, based on patterns of conservation of 237 physical-chemical properties of amino acids in aligned protein sequences, to find related motifs in proteins with little or no overall sequence similarity. As an application, our web-server MASIA identified 12 property-based motifs in the apurinic/apyrimidinic endonuclease (APE) family of DNA-repair enzymes of the DNase-I superfamily. Searching with these motifs located distantly related representatives of the DNase-I superfamily, such as Inositol 5'-polyphosphate phosphatases in the ASTRAL40 database, using a Bayesian scoring function. Other proteins containing APE motifs had no overall sequence or structural similarity. However, all were phosphatases and/or had a metal ion binding active site. Thus our automated method can identify discrete elements in distantly related proteins that define local structure and aspects of function. We anticipate that our method will complement existing ones to functionally annotate novel protein sequences from genomic projects.
MASIA WEB site: http://www.scsb.utmb.edu/masia/masia.html
The dendrogram of 42 APE sequences used to derive motifs is available on http://www.scsb.utmb.edu/comp_biol.html/DNA_repair/publication.html
识别蛋白质家族或超家族共有的短保守序列基序,在推断新基因产物的功能方面可能比整体序列相似性更有用。定位基序仍需要专业知识,因为使用严格标准的自动化方法可能无法区分细微的相似性与统计噪声。
我们开发了一种新颖的自动方法,基于比对蛋白质序列中氨基酸的237种物理化学性质的保守模式,以在整体序列相似性很少或没有的蛋白质中找到相关基序。作为应用,我们的网络服务器MASIA在DNase-I超家族的DNA修复酶的脱嘌呤/脱嘧啶内切核酸酶(APE)家族中鉴定出12个基于性质的基序。使用贝叶斯评分函数,用这些基序搜索DNase-I超家族中远距离相关的代表,如ASTRAL40数据库中的肌醇5'-多磷酸磷酸酶。其他含有APE基序的蛋白质没有整体序列或结构相似性。然而,所有这些都是磷酸酶和/或具有金属离子结合活性位点。因此,我们的自动化方法可以识别远距离相关蛋白质中的离散元件,这些元件定义了局部结构和功能方面。我们预计我们的方法将补充现有方法,以便从基因组计划中对新的蛋白质序列进行功能注释。
MASIA网站:http://www.scsb.utmb.edu/masia/masia.html
用于推导基序的42个APE序列的树状图可在http://www.scsb.utmb.edu/comp_biol.html/DNA_repair/publication.html上获得