Babnigg György, Giometti Carol S
Protein Mapping Group, Biosceinces Division, Argonne National Laboratory, IL 60439, USA.
Proteomics. 2006 Aug;6(16):4514-22. doi: 10.1002/pmic.200600032.
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.
在蛋白质组研究中,蛋白质的鉴定需要搜索蛋白质序列数据库。公共蛋白质序列数据库(如NCBInr、UniProt)每个都包含数百万条记录,而私有数据库又增加了数千条。尽管这些数据库中的许多序列信息是冗余的,但每个数据库对相同的蛋白质序列使用不同的标识符,并且通常包含独特的注释信息。一个数据库的用户获得的是特定于该数据库的序列标识符,而该标识符往往很难与来自不同数据库的标识符进行协调。当使用多个数据库进行搜索或被搜索的数据库频繁更新时,解释蛋白质鉴定结果和相关注释可能会出现问题。我们开发了一个独特蛋白质序列标识符数据库,称为源自原始蛋白质序列的序列全局唯一标识符(SEGUID)。这些标识符充当多个序列数据库之间的通用链接,并且在给定蛋白质序列的整个生命周期内,能抵御公共或私有数据库中的注释变化。SEGUID数据库可以从(http://bioinformatics.anl.gov/SEGUID/)下载,或者在任何能够访问原始蛋白质序列数据库的站点轻松生成。由于SEGUID是稳定的,基于原始序列信息(如pI、Mr)的预测只需计算一次;我们已经为超过250万个序列生成了大约500种不同的计算结果。SEGUID用于将质谱和二维电泳数据与生物信息学信息整合起来,并有机会搜索多个蛋白质序列数据库,从而提高找到最有效蛋白质鉴定结果的概率。