Oates Matt E, Stahlhacke Jonathan, Vavoulis Dimitrios V, Smithers Ben, Rackham Owen J L, Sardar Adam J, Zaucha Jan, Thurlby Natalie, Fang Hai, Gough Julian
Computer Science, University of Bristol, Bristol, BS8 1UB, UK
Computer Science, University of Bristol, Bristol, BS8 1UB, UK.
Nucleic Acids Res. 2015 Jan;43(Database issue):D227-33. doi: 10.1093/nar/gku1041. Epub 2014 Nov 20.
We present updates to the SUPERFAMILY 1.75 (http://supfam.org) online resource and protein sequence collection. The hidden Markov model library that provides sequence homology to SCOP structural domains remains unchanged at version 1.75. In the last 4 years SUPERFAMILY has more than doubled its holding of curated complete proteomes over all cellular life, from 1400 proteomes reported previously in 2010 up to 3258 at present. Outside of the main sequence collection, SUPERFAMILY continues to provide domain annotation for sequences provided by other resources such as: UniProt, Ensembl, PDB, much of JGI Phytozome and selected subcollections of NCBI RefSeq. Despite this growth in data volume, SUPERFAMILY now provides users with an expanded and daily updated phylogenetic tree of life (sTOL). This tree is built with genomic-scale domain annotation data as before, but constantly updated when new species are introduced to the sequence library. Our Gene Ontology and other functional and phenotypic annotations previously reported have stood up to critical assessment by the function prediction community. We have now introduced these data in an integrated manner online at the level of an individual sequence, and--in the case of whole genomes--with enrichment analysis against a taxonomically defined background.
我们展示了在线资源SUPERFAMILY 1.75(http://supfam.org)及蛋白质序列集的更新内容。提供与SCOP结构域序列同源性的隐马尔可夫模型库在1.75版本保持不变。在过去4年中,SUPERFAMILY涵盖的所有细胞生命的经过整理的完整蛋白质组数量增加了一倍多,从2010年之前报告的1400个蛋白质组增加到目前的3258个。在主要序列集之外,SUPERFAMILY继续为其他资源提供的序列进行结构域注释,这些资源包括:UniProt、Ensembl、PDB、许多JGI植物基因组数据库中的序列以及NCBI RefSeq的选定子集合。尽管数据量有所增长,但SUPERFAMILY现在为用户提供了一个经过扩展且每日更新的生命系统发育树(sTOL)。这棵树像以前一样基于基因组规模的结构域注释数据构建,但在新物种被引入序列库时会不断更新。我们之前报告的基因本体及其他功能和表型注释已经经受住了功能预测领域的严格评估。我们现在已将这些数据以整合的方式在线呈现,在单个序列层面,对于全基因组则是在分类定义的背景下进行富集分析。