Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, ul. Miklukho-Maklaya 16/10, 117997 Moscow, Russia.
BMC Genomics. 2011 Jan 31;12:88. doi: 10.1186/1471-2164-12-88.
Novel high throughput sequencing technologies require permanent development of bioinformatics data processing methods. Among them, rapid and reliable identification of encoded proteins plays a pivotal role. To search for particular protein families, the amino acid sequence motifs suitable for selective screening of nucleotide sequence databases may be used. In this work, we suggest a novel method for simplified representation of protein amino acid sequences named Single Residue Distribution Analysis, which is applicable both for homology search and database screening.
Using the procedure developed, a search for amino acid sequence motifs in sea anemone polypeptides was performed, and 14 different motifs with broad and low specificity were discriminated. The adequacy of motifs for mining toxin-like sequences was confirmed by their ability to identify 100% toxin-like anemone polypeptides in the reference polypeptide database. The employment of novel motifs for the search of polypeptide toxins in Anemonia viridis EST dataset allowed us to identify 89 putative toxin precursors. The translated and modified ESTs were scanned using a special algorithm. In addition to direct comparison with the motifs developed, the putative signal peptides were predicted and homology with known structures was examined.
The suggested method may be used to retrieve structures of interest from the EST databases using simple amino acid sequence motifs as templates. The efficiency of the procedure for directed search of polypeptides is higher than that of most currently used methods. Analysis of 39939 ESTs of sea anemone Anemonia viridis resulted in identification of five protein precursors of earlier described toxins, discovery of 43 novel polypeptide toxins, and prediction of 39 putative polypeptide toxin sequences. In addition, two precursors of novel peptides presumably displaying neuronal function were disclosed.
新型高通量测序技术需要不断开发生物信息学数据处理方法。其中,快速可靠地鉴定编码蛋白起着关键作用。为了搜索特定的蛋白质家族,可以使用适合于核苷酸序列数据库选择性筛选的氨基酸序列基序。在这项工作中,我们提出了一种新的简化蛋白质氨基酸序列表示的方法,称为单残基分布分析,该方法既适用于同源性搜索,也适用于数据库筛选。
使用所开发的程序,对海葵多肽中的氨基酸序列基序进行了搜索,并区分了 14 个具有广泛和低特异性的不同基序。通过它们能够识别参考多肽数据库中 100%的毒素样海葵多肽的能力,证明了基序用于挖掘毒素样序列的充分性。在 Anemonia viridis EST 数据集的搜索多肽毒素中使用新的基序,允许我们鉴定 89 个假定的毒素前体。使用特殊算法扫描翻译和修饰的 ESTs。除了与所开发的基序进行直接比较外,还预测了假定的信号肽,并检查了与已知结构的同源性。
该方法可用于使用简单的氨基酸序列基序作为模板从 EST 数据库中检索感兴趣的结构。该程序用于定向搜索多肽的效率高于目前大多数使用的方法。对海葵 Anemonia viridis 的 39939 个 EST 的分析导致鉴定了先前描述的毒素的五个蛋白质前体,发现了 43 种新的多肽毒素,并预测了 39 种假定的多肽毒素序列。此外,还揭示了两种假定具有神经元功能的新型肽前体。