Department of Plant Pathology, University of Minnesota, St, Paul, Minnesota 55108, USA.
BMC Bioinformatics. 2013 Nov 20;14:335. doi: 10.1186/1471-2105-14-335.
Small peptides encoded as one- or two-exon genes in plants have recently been shown to affect multiple aspects of plant development, reproduction and defense responses. However, popular similarity search tools and gene prediction techniques generally fail to identify most members belonging to this class of genes. This is largely due to the high sequence divergence among family members and the limited availability of experimentally verified small peptides to use as training sets for homology search and ab initio prediction. Consequently, there is an urgent need for both experimental and computational studies in order to further advance the accurate prediction of small peptides.
We present here a homology-based gene prediction program to accurately predict small peptides at the genome level. Given a high-quality profile alignment, SPADA identifies and annotates nearly all family members in tested genomes with better performance than all general-purpose gene prediction programs surveyed. We find numerous mis-annotations in the current Arabidopsis thaliana and Medicago truncatula genome databases using SPADA, most of which have RNA-Seq expression support. We also show that SPADA works well on other classes of small secreted peptides in plants (e.g., self-incompatibility protein homologues) as well as non-secreted peptides outside the plant kingdom (e.g., the alpha-amanitin toxin gene family in the mushroom, Amanita bisporigera).
SPADA is a free software tool that accurately identifies and predicts the gene structure for short peptides with one or two exons. SPADA is able to incorporate information from profile alignments into the model prediction process and makes use of it to score different candidate models. SPADA achieves high sensitivity and specificity in predicting small plant peptides such as the cysteine-rich peptide families. A systematic application of SPADA to other classes of small peptides by research communities will greatly improve the genome annotation of different protein families in public genome databases.
最近的研究表明,植物中作为一单或双外显子基因编码的小肽能够影响植物发育、繁殖和防御反应的多个方面。然而,流行的相似性搜索工具和基因预测技术通常无法识别属于这一类基因的大多数成员。这主要是由于家族成员之间的序列高度分化,以及缺乏经过实验验证的小肽作为同源搜索和从头预测的训练集。因此,无论是实验研究还是计算研究,都迫切需要进一步准确预测小肽。
我们在这里提出了一种基于同源性的基因预测程序,用于在基因组水平上准确预测小肽。给定高质量的轮廓比对,SPADA 能够识别和注释测试基因组中的几乎所有家族成员,其性能优于所有调查的通用基因预测程序。使用 SPADA,我们在当前的拟南芥和蒺藜苜蓿基因组数据库中发现了大量的错误注释,其中大多数都有 RNA-Seq 表达支持。我们还表明,SPADA 在植物中其他种类的小分泌肽(如自交不亲和蛋白同源物)以及植物王国以外的非分泌肽(如蘑菇 Amanita bisporigera 中的 alpha-amanitin 毒素基因家族)上也能很好地工作。
SPADA 是一个免费的软件工具,能够准确识别和预测具有一个或两个外显子的短肽的基因结构。SPADA 能够将来自轮廓比对的信息纳入模型预测过程,并利用它来对不同的候选模型进行评分。SPADA 在预测小植物肽(如富含半胱氨酸的肽家族)方面具有较高的灵敏度和特异性。研究界系统地将 SPADA 应用于其他种类的小肽,将极大地提高公共基因组数据库中不同蛋白质家族的基因组注释水平。