Lease Kevin A, Walker John C
Division of Biological Sciences, University of Missouri, Columbia, Missouri 65211, USA.
Plant Physiol. 2006 Nov;142(3):831-8. doi: 10.1104/pp.106.086041. Epub 2006 Sep 22.
In the era of genomics, if a gene is not annotated, it is not investigated. Due to their small size, genes encoding peptides are often missed in genome annotations. Secreted peptides are important regulators of plant growth, development, and physiology. Identification of additional peptide signals by sequence homology searches has had limited success due to sequence heterogeneity. A bioinformatics approach was taken to find unannotated Arabidopsis (Arabidopsis thaliana) peptides. Arabidopsis chromosome sequences were searched for all open reading frames (ORFs) encoding peptides and small proteins between 25 and 250 amino acids in length. The translated ORFs were then sequentially queried for the presence of an amino-terminal cleavable signal peptide, the absence of transmembrane domains, and the absence of endoplasmic reticulum lumenal retention sequences. Next, the ORFs were filtered against the The Arabidopsis Information Resource 6.0 annotated Arabidopsis genes to remove those ORFs overlapping known genes. The remaining 33,809 ORFs were placed in a relational database to which additional annotation data were deposited. Genome-wide tiling array data were compared with the coordinates of the ORFs, supporting the possibility that many of the ORFs may be expressed. In addition, clustering and sequence similarity analyses revealed that many of the putative peptides are in gene families and/or appear to be present in the rice (Oryza sativa) genome. A subset of the ORFs was evaluated by reverse transcription-PCR and, for one-fifth of those, expression was detected. These results support the idea that the number and diversity of plant peptides is broader than currently assumed. The peptides identified and their annotation data may be viewed or downloaded through a searchable Web interface at peptidome.missouri.edu.
在基因组学时代,如果一个基因没有注释,就不会对其进行研究。由于其尺寸小,编码肽的基因在基因组注释中常常被遗漏。分泌肽是植物生长、发育和生理的重要调节因子。由于序列异质性,通过序列同源性搜索鉴定额外的肽信号成效有限。我们采用了一种生物信息学方法来寻找未注释的拟南芥(Arabidopsis thaliana)肽。在拟南芥染色体序列中搜索所有编码长度在25至250个氨基酸之间的肽和小蛋白的开放阅读框(ORF)。然后依次查询翻译后的ORF是否存在氨基末端可切割信号肽、是否不存在跨膜结构域以及是否不存在内质网腔滞留序列。接下来,将这些ORF与拟南芥信息资源6.0注释的拟南芥基因进行比对,以去除那些与已知基因重叠的ORF。其余33,809个ORF被放入一个关系数据库,并在其中存入了额外的注释数据。将全基因组平铺阵列数据与ORF的坐标进行比较,支持了许多ORF可能被表达的可能性。此外,聚类和序列相似性分析表明,许多推定的肽属于基因家族和/或似乎存在于水稻(Oryza sativa)基因组中。通过逆转录PCR对一部分ORF进行了评估,其中五分之一检测到了表达。这些结果支持了植物肽的数量和多样性比目前所认为的更为广泛这一观点。所鉴定的肽及其注释数据可通过peptidome.missouri.edu上的可搜索网络界面进行查看或下载。