Suppr超能文献

代表性蛋白质数据集的选择。

Selection of representative protein data sets.

作者信息

Hobohm U, Scharf M, Schneider R, Sander C

机构信息

European Molecular Biology Laboratory, Heidelberg, Germany.

出版信息

Protein Sci. 1992 Mar;1(3):409-17. doi: 10.1002/pro.5560010313.

Abstract

The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server "netserv@embl-heidelberg.de." The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three-dimensional protein structures.

摘要

蛋白质数据库目前包含约600个通过X射线晶体学或核磁共振确定的三维蛋白质坐标数据集。数据库中存在相当多的冗余,因为许多蛋白质对在序列上相同或非常相似。然而,蛋白质序列-结构关系的统计分析需要非冗余数据。我们开发了两种算法,从数据库中提取具有最大覆盖率和最小冗余的蛋白质链代表性集合。第一种算法侧重于优化所选蛋白质的特定属性,其工作方式是从有序列表中连续选择蛋白质,并排除每个所选蛋白质的所有相邻蛋白质。另一种算法旨在使所选集合的大小最大化,其工作方式是连续剔除相似蛋白质的簇。这两种算法通常适用于可以定义相似性标准的其他数据库,并且与图论中的问题相关。从蛋白质数据库的当前版本中提取的最大非冗余集有155条蛋白质链。在这个集合中,没有两个蛋白质的序列相似性高于某个截止值(对于长度超过80个残基的比对子序列,相同残基为30%),但所有结构独特的蛋白质家族都有代表。可通过电子邮件从文件服务器“netserv@embl - heidelberg.de”获取定期更新的代表性数据集列表。这种选择在蛋白质折叠的统计方法以及已知三维蛋白质结构谱的分析和记录中可能有用。

相似文献

1
Selection of representative protein data sets.代表性蛋白质数据集的选择。
Protein Sci. 1992 Mar;1(3):409-17. doi: 10.1002/pro.5560010313.
5
Enlarged representative set of protein structures.扩大的蛋白质结构代表性集合。
Protein Sci. 1994 Mar;3(3):522-4. doi: 10.1002/pro.5560030317.

引用本文的文献

6
SignalP: The Evolution of a Web Server.SignalP:一个网络服务器的发展历程。
Methods Mol Biol. 2024;2836:331-367. doi: 10.1007/978-1-0716-4007-4_17.
10
Protein Sorting Prediction.蛋白质分拣预测。
Methods Mol Biol. 2024;2715:27-63. doi: 10.1007/978-1-0716-3445-5_2.

本文引用的文献

1
Identification of common molecular subsequences.常见分子子序列的鉴定
J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.
4
Improved tools for biological sequence comparison.用于生物序列比较的改进工具。
Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. doi: 10.1073/pnas.85.8.2444.
6
Protein structure alignment.蛋白质结构比对
J Mol Biol. 1989 Jul 5;208(1):1-22. doi: 10.1016/0022-2836(89)90084-3.
7
A rapid method of protein structure alignment.一种快速的蛋白质结构比对方法。
J Theor Biol. 1990 Dec 21;147(4):517-51. doi: 10.1016/s0022-5193(05)80263-2.
10
The SWISS-PROT protein sequence data bank.瑞士蛋白质序列数据库。
Nucleic Acids Res. 1991 Apr 25;19 Suppl(Suppl):2247-9. doi: 10.1093/nar/19.suppl.2247.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验