Mulder Nicola Jane, Kersey Paul, Pruess Manuela, Apweiler Rolf
EMBL Outstation - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Mol Biotechnol. 2008 Feb;38(2):165-77. doi: 10.1007/s12033-007-9003-x. Epub 2007 Oct 4.
Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.
来自基因组测序项目的核酸序列作为原始数据提交,生物学家试图从中阐明预测的基因产物的功能。蛋白质序列存储在公共数据库中,如UniProt知识库(UniProtKB),其中管理员会尝试添加预测的和实验性的功能信息。蛋白质功能预测可以通过序列相似性搜索来完成,但另一种方法是使用蛋白质特征,将蛋白质分类为家族和结构域。主要的蛋白质特征数据库可通过集成的InterPro数据库获得,该数据库提供了UniProtKB序列的分类。除了通过蛋白质家族对蛋白质进行表征外,许多研究人员还对分析来自一个基因组的完整蛋白质集(即蛋白质组)感兴趣,并且有数据库和资源提供非冗余蛋白质组集以及对来自具有完全测序基因组的生物体的蛋白质进行分析。本文综述了网络上可用于单个和大规模蛋白质表征以及全蛋白质组分析的工具和资源。