Bachinsky A G, Frolov A S, Naumochkin A N, Nizolenko L P, Yarigin A A
Theoretical Department, Research Institute of Molecular Biology, SRC VB 'Vector', Koltsovo, Novosibirsk region, 630559, Russia.
Bioinformatics. 2000 Apr;16(4):358-66. doi: 10.1093/bioinformatics/16.4.358.
When analysing novel protein sequences, it is now essential to extend search strategies to include a range of 'secondary' databases. Pattern databases have become vital tools for identifying distant relationships in sequences, and hence for predicting protein function and structure. The main drawback of such methods is the relatively small representation of proteins in trial samples at the time of their construction. Therefore, a negative result of an amino acid sequence comparison with such a databank forces a researcher to search for similarities in the original protein banks. We developed a database of patterns constructed for groups of related proteins with maximum representation of amino acid sequences of SWISS-PROT in the groups.
Software tools and a new method have been designed to construct patterns of protein families. By using such method, a new version of databank of protein family patterns, PROF_ PAT 1.3, is produced. This bank is based on SWISS-PROT (r1.38) and TrEMBL (r1.11), and contains patterns of more than 13 000 groups of related proteins in a format similar to that of the PROSITE. Motifs of patterns, which had the minimum level of probability to be found in random sequences, were selected. Flexible fast search program accompanies the bank. The researcher can specify a similarity matrix (the type PAM, BLOSUM and other). Variable levels of similarity can be set (permitting search strategies ranging from exact matches to increasing levels of 'fuzziness').
The Internet address for comparing sequences with the bank is: http://wwwmgs.bionet.nsc.ru/mgs/programs/prof_pat/. The local version of the bank and search programs (approximately 50 Mb) is available via ftp: ftp://ftp.bionet.nsc. ru/pub/biology/vector/prof_pat/, and ftp://ftp.ebi.ac. uk/pub/databases/prof_pat/. Another appropriate way for its external use is to mail amino acid sequences to bachin@vector.nsc.ru for comparison with PROF_ PAT 1.3.
在分析新的蛋白质序列时,现在有必要扩展搜索策略,将一系列“二级”数据库纳入其中。模式数据库已成为识别序列中远距离关系的重要工具,从而用于预测蛋白质的功能和结构。此类方法的主要缺点是在构建时试验样本中蛋白质的代表性相对较小。因此,与这样一个数据库进行氨基酸序列比较得到的阴性结果会迫使研究人员在原始蛋白质库中寻找相似性。我们开发了一个模式数据库,该数据库是为相关蛋白质组构建的,其中SWISS-PROT的氨基酸序列在这些组中具有最大代表性。
已设计出软件工具和一种新方法来构建蛋白质家族模式。通过使用这种方法,产生了蛋白质家族模式数据库的新版本PROF_PAT 1.3。该数据库基于SWISS-PROT(r1.38)和TrEMBL(r1.11),并以类似于PROSITE的格式包含13000多个相关蛋白质组的模式。选择了在随机序列中出现概率最低的模式基序。该数据库配有灵活的快速搜索程序。研究人员可以指定相似性矩阵(PAM、BLOSUM等类型)。可以设置不同程度的相似性(允许搜索策略从精确匹配到逐渐增加的“模糊”程度)。
用于与该数据库比较序列的互联网地址为:http://wwwmgs.bionet.nsc.ru/mgs/programs/prof_pat/。该数据库和搜索程序的本地版本(约50 Mb)可通过ftp获取:ftp://ftp.bionet.nsc.ru/pub/biology/vector/prof_pat/,以及ftp://ftp.ebi.ac.uk/pub/databases/prof_pat/。另一种适合外部使用的方式是将氨基酸序列发送至bachin@vector.nsc.ru,以便与PROF_PAT 1.3进行比较。