Bailey T L, Baker M E, Elkan C P
Department of Computer Science and Engineering, University of California, San Diego, La Jolla 92093, U.S.A.
J Steroid Biochem Mol Biol. 1997 May;62(1):29-44. doi: 10.1016/s0960-0760(97)00013-7.
MEME (Multiple Expectation-maximization for Motif Elicitation) is a unique new software tool that uses artificial intelligence techniques to discover motifs shared by a set of protein sequences in a fully automated manner. This paper is the first detailed study of the use of MEME to analyse a large, biologically relevant set of sequences, and to evaluate the sensitivity and accuracy of MEME in identifying structurally important motifs. For this purpose, we chose the short-chain alcohol dehydrogenase superfamily because it is large and phylogenetically diverse, providing a test of how well MEME can work on sequences with low amino acid similarity. Moreover, this dataset contains enzymes of biological importance, and because several enzymes have known X-ray crystallographic structures, we can test the usefulness of MEME for structural analysis. The first six motifs from MEME map onto structurally important alpha-helices and beta-strands on Streptomyces hydrogenans 20beta-hydroxysteroid dehydrogenase. We also describe MAST (Motif Alignment Search Tool), which conveniently uses output from MEME for searching databases such as SWISS-PROT and Genpept. MAST provides statistical measures that permit a rigorous evaluation of the significance of database searches with individual motifs or groups of motifs. A database search of Genpept90 by MAST with the log-odds matrix of the first six motifs obtained from MEME yields a bimodal output, demonstrating the selectivity of MAST. We show for the first time, using primary sequence analysis, that bacterial sugar epimerases are homologs of short-chain dehydrogenases. MEME and MAST will be increasingly useful as genome sequencing provides large datasets of phylogenetically divergent sequences of biomedical interest.
MEME(用于基序提取的多重期望最大化算法)是一款独特的新型软件工具,它使用人工智能技术以全自动方式发现一组蛋白质序列共有的基序。本文首次详细研究了使用MEME分析大量具有生物学相关性的序列集,并评估MEME在识别结构重要基序方面的敏感性和准确性。为此,我们选择了短链醇脱氢酶超家族,因为它规模大且系统发育多样,可用于测试MEME在处理氨基酸相似性较低的序列时的效果。此外,该数据集包含具有生物学重要性的酶,并且由于几种酶具有已知的X射线晶体结构,我们可以测试MEME在结构分析方面的实用性。MEME识别出的前六个基序对应于氢化链霉菌20β-羟基类固醇脱氢酶结构上重要的α-螺旋和β-链。我们还介绍了MAST(基序比对搜索工具),它可以方便地使用MEME的输出结果搜索诸如SWISS-PROT和Genpept等数据库。MAST提供统计量度,可对使用单个基序或基序组进行数据库搜索的显著性进行严格评估。使用MAST通过MEME获得的前六个基序的对数几率矩阵对Genpept90进行数据库搜索,得到双峰输出,证明了MAST的选择性。我们首次通过一级序列分析表明,细菌糖差向异构酶是短链脱氢酶的同源物。随着基因组测序提供大量具有生物医学研究价值的系统发育差异序列的数据集,MEME和MAST将变得越来越有用。