Bleasby A J, Wootton J C
Departments of Genetics and Biophysics, University of Leeds, UK.
Protein Eng. 1990 Jan;3(3):153-9. doi: 10.1093/protein/3.3.153.
A strategy has been developed for the construction of a validated, comprehensive composite protein sequence database. Entries are amalgamated from primary source data bases by a largely automated set of processes in which redundant and trivially different entries are eliminated. A modular approach has been adopted to allow scientific judgement to be used at each stage of database processing and amalgamation. Source databases are assigned a priority depending on the quality of sequence validation and commenting. Rejection of entries from the lower priority database, in each pairwise comparison of databases, is carried out according to optionally defined redundancy criteria based on sequence segment mismatches. Efficient algorithms for this methodology are embodied in the COMPO software system. COMPO has been applied for over 2 years in construction and regular updating of the OWL composite protein sequence database from the source databases NBRF-PIR, SWISS-PROT, a GenBank translation retrieved from the feature tables, NBRF-NEW, NEWAT86, PSD-KYOTO and the sequences contained in the Brookhaven protein structure databank. OWL is part of the ISIS integrated data resource of protein sequence and structure [Akrigg et al. (1988) Nature, 335, 745-746]. The modular nature of the integration process greatly facilitates the frequent updating of OWL following releases of the source databases. The extent of redundancy in these sources is revealed by the comparison process. The advantages of a robust composite database for sequence similarity searching and information retrieval are discussed.
已开发出一种策略,用于构建一个经过验证的、全面的复合蛋白质序列数据库。通过一套基本自动化的流程,将来自原始数据库的条目进行合并,在此过程中消除冗余和差异极小的条目。采用了模块化方法,以便在数据库处理和合并的每个阶段都能运用科学判断。根据序列验证和注释的质量,为源数据库分配优先级。在数据库的每一次两两比较中,根据基于序列片段错配的可选定义冗余标准,拒绝来自低优先级数据库的条目。该方法的高效算法体现在COMPO软件系统中。COMPO已应用两年多,用于从源数据库NBRF-PIR、SWISS-PROT、从特征表中检索到的GenBank翻译、NBRF-NEW、NEWAT86、PSD-KYOTO以及布鲁克海文蛋白质结构数据库中包含的序列构建和定期更新OWL复合蛋白质序列数据库。OWL是蛋白质序列和结构的ISIS集成数据资源的一部分[Akrigg等人(1988年),《自然》,335卷,745 - 746页]。集成过程的模块化性质极大地促进了源数据库发布后OWL的频繁更新。比较过程揭示了这些来源中的冗余程度。讨论了一个强大的复合数据库在序列相似性搜索和信息检索方面的优势。