Molecular Biology and Biochemistry Department, Simon Fraser University, Burnaby, British Columbia, Canada.
PLoS One. 2013 Apr 25;8(4):e62204. doi: 10.1371/journal.pone.0062204. Print 2013.
The C. elegans genome has been extensively annotated by the WormBase consortium that uses state of the art bioinformatics pipelines, functional genomics and manual curation approaches. As a result, the identification of novel genes in silico in this model organism is becoming more challenging requiring new approaches. The Oligonucleotide-oligosaccharide binding (OB) fold is a highly divergent protein family, in which protein sequences, in spite of having the same fold, share very little sequence identity (5-25%). Therefore, evidence from sequence-based annotation may not be sufficient to identify all the members of this family. In C. elegans, the number of OB-fold proteins reported is remarkably low (n=46) compared to other evolutionary-related eukaryotes, such as yeast S. cerevisiae (n=344) or fruit fly D. melanogaster (n=84). Gene loss during evolution or differences in the level of annotation for this protein family, may explain these discrepancies.
METHODOLOGY/PRINCIPAL FINDINGS: This study examines the possibility that novel OB-fold coding genes exist in the worm. We developed a bioinformatics approach that uses the most sensitive sequence-sequence, sequence-profile and profile-profile similarity search methods followed by 3D-structure prediction as a filtering step to eliminate false positive candidate sequences. We have predicted 18 coding genes containing the OB-fold that have remarkably partially been characterized in C. elegans.
CONCLUSIONS/SIGNIFICANCE: This study raises the possibility that the annotation of highly divergent protein fold families can be improved in C. elegans. Similar strategies could be implemented for large scale analysis by the WormBase consortium when novel versions of the genome sequence of C. elegans, or other evolutionary related species are being released. This approach is of general interest to the scientific community since it can be used to annotate any genome.
WormBase 联盟通过使用最先进的生物信息学管道、功能基因组学和人工策展方法,对秀丽隐杆线虫的基因组进行了广泛注释。因此,在这个模式生物中,通过计算机从基因序列中识别新基因变得更具挑战性,需要新的方法。寡核苷酸-寡糖结合(OB)折叠是一个高度多样化的蛋白质家族,尽管具有相同的折叠结构,但蛋白质序列的序列同一性非常低(5-25%)。因此,基于序列的注释证据可能不足以识别这个家族的所有成员。在秀丽隐杆线虫中,与其他进化相关的真核生物(如酿酒酵母 S. cerevisiae 有 344 个,或黑腹果蝇 D. melanogaster 有 84 个)相比,报道的 OB 折叠蛋白数量显著较少(n=46)。这种蛋白家族的基因丢失或注释水平的差异,可能解释了这些差异。
方法/主要发现:本研究探讨了线虫中是否存在新的 OB 折叠编码基因的可能性。我们开发了一种生物信息学方法,该方法使用最敏感的序列-序列、序列-模式和模式-模式相似性搜索方法,然后进行 3D 结构预测作为过滤步骤,以消除假阳性候选序列。我们预测了 18 个含有 OB 折叠的编码基因,这些基因在秀丽隐杆线虫中部分得到了很好的表征。
结论/意义:本研究提出了一种可能性,即在秀丽隐杆线虫中,可以改进高度多样化的蛋白质折叠家族的注释。当秀丽隐杆线虫或其他进化相关物种的基因组序列的新版本发布时,WormBase 联盟可以采用类似的策略进行大规模分析。这种方法对科学界具有普遍意义,因为它可以用于注释任何基因组。