• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

代表性蛋白质数据集的选择。

Selection of representative protein data sets.

作者信息

Hobohm U, Scharf M, Schneider R, Sander C

机构信息

European Molecular Biology Laboratory, Heidelberg, Germany.

出版信息

Protein Sci. 1992 Mar;1(3):409-17. doi: 10.1002/pro.5560010313.

DOI:10.1002/pro.5560010313
PMID:1304348
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2142204/
Abstract

The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server "netserv@embl-heidelberg.de." The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three-dimensional protein structures.

摘要

蛋白质数据库目前包含约600个通过X射线晶体学或核磁共振确定的三维蛋白质坐标数据集。数据库中存在相当多的冗余,因为许多蛋白质对在序列上相同或非常相似。然而,蛋白质序列-结构关系的统计分析需要非冗余数据。我们开发了两种算法,从数据库中提取具有最大覆盖率和最小冗余的蛋白质链代表性集合。第一种算法侧重于优化所选蛋白质的特定属性,其工作方式是从有序列表中连续选择蛋白质,并排除每个所选蛋白质的所有相邻蛋白质。另一种算法旨在使所选集合的大小最大化,其工作方式是连续剔除相似蛋白质的簇。这两种算法通常适用于可以定义相似性标准的其他数据库,并且与图论中的问题相关。从蛋白质数据库的当前版本中提取的最大非冗余集有155条蛋白质链。在这个集合中,没有两个蛋白质的序列相似性高于某个截止值(对于长度超过80个残基的比对子序列,相同残基为30%),但所有结构独特的蛋白质家族都有代表。可通过电子邮件从文件服务器“netserv@embl - heidelberg.de”获取定期更新的代表性数据集列表。这种选择在蛋白质折叠的统计方法以及已知三维蛋白质结构谱的分析和记录中可能有用。

相似文献

1
Selection of representative protein data sets.代表性蛋白质数据集的选择。
Protein Sci. 1992 Mar;1(3):409-17. doi: 10.1002/pro.5560010313.
2
A database of protein structure families with common folding motifs.一个具有共同折叠基序的蛋白质结构家族数据库。
Protein Sci. 1992 Dec;1(12):1691-8. doi: 10.1002/pro.5560011217.
3
The FSSP database of structurally aligned protein fold families.结构比对蛋白质折叠家族的FSSP数据库。
Nucleic Acids Res. 1994 Sep;22(17):3600-9.
4
Selection of a representative set of structures from Brookhaven Protein Data Bank.从布鲁克海文蛋白质数据库中选择一组具有代表性的结构。
Proteins. 1992 Oct;14(2):265-76. doi: 10.1002/prot.340140212.
5
Enlarged representative set of protein structures.扩大的蛋白质结构代表性集合。
Protein Sci. 1994 Mar;3(3):522-4. doi: 10.1002/pro.5560030317.
6
Protein structural domains: analysis of the 3Dee domains database.蛋白质结构域:3Dee结构域数据库分析
Proteins. 2001 Feb 15;42(3):332-44.
7
A topology-constrained distance network algorithm for protein structure determination from NOESY data.一种用于从NOESY数据确定蛋白质结构的拓扑约束距离网络算法。
Proteins. 2006 Mar 15;62(3):587-603. doi: 10.1002/prot.20820.
8
OLDERADO: on-line database of ensemble representatives and domains. On Line Database of Ensemble Representatives And DOmains.OLDERADO:整体代表与结构域在线数据库。整体代表与结构域在线数据库。
Protein Sci. 1997 Dec;6(12):2628-30. doi: 10.1002/pro.5560061215.
9
Alignment and searching for common protein folds using a data bank of structural templates.利用结构模板数据库进行比对并寻找常见蛋白质折叠。
J Mol Biol. 1993 Jun 5;231(3):735-52. doi: 10.1006/jmbi.1993.1323.
10
PRENRL_3D: a computer program for an automatic creation of NRL_3D, protein sequence-structure database, from the Protein Data Bank.PRENRL_3D:一个用于从蛋白质数据库自动创建蛋白质序列-结构数据库NRL_3D的计算机程序。
Protein Seq Data Anal. 1991 Dec;4(6):333-6.

引用本文的文献

1
Practical Applications of Language Models in Protein Sorting Prediction: SignalP 6.0, DeepLoc 2.1, and DeepLocPro 1.0.语言模型在蛋白质分选预测中的实际应用:SignalP 6.0、DeepLoc 2.1和DeepLocPro 1.0
Methods Mol Biol. 2025;2941:153-175. doi: 10.1007/978-1-0716-4623-6_10.
2
Labelizer: systematic selection of protein residues for covalent fluorophore labeling.标记器:用于共价荧光团标记的蛋白质残基的系统选择。
Nat Commun. 2025 May 4;16(1):4147. doi: 10.1038/s41467-025-58602-y.
3
Identifying representative sequences of protein families using submodular optimization.使用次模优化识别蛋白质家族的代表性序列。
Sci Rep. 2025 Jan 7;15(1):1069. doi: 10.1038/s41598-025-85165-1.
4
Immune changes in pregnancy: associations with pre-existing conditions and obstetrical complications at the 20th gestational week-a prospective cohort study.孕期免疫变化:与妊娠20周时的既往疾病及产科并发症的关联——一项前瞻性队列研究
BMC Med. 2024 Dec 18;22(1):583. doi: 10.1186/s12916-024-03797-y.
5
SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.SpanSeq:基于相似度的序列数据分割方法,用于改进深度学习项目的开发与评估。
NAR Genom Bioinform. 2024 Aug 16;6(3):lqae106. doi: 10.1093/nargab/lqae106. eCollection 2024 Sep.
6
SignalP: The Evolution of a Web Server.SignalP:一个网络服务器的发展历程。
Methods Mol Biol. 2024;2836:331-367. doi: 10.1007/978-1-0716-4007-4_17.
7
Enhancing TCR specificity predictions by combined pan- and peptide-specific training, loss-scaling, and sequence similarity integration.通过联合 pan- 和肽特异性训练、损失缩放和序列相似性集成来增强 TCR 特异性预测。
Elife. 2024 Mar 4;12:RP93934. doi: 10.7554/eLife.93934.
8
A large-scale study of peptide features defining immunogenicity of cancer neo-epitopes.一项关于定义癌症新抗原免疫原性的肽特征的大规模研究。
NAR Cancer. 2024 Jan 29;6(1):zcae002. doi: 10.1093/narcan/zcae002. eCollection 2024 Mar.
9
Accurate prediction of HLA class II antigen presentation across all loci using tailored data acquisition and refined machine learning.利用定制的数据采集和改进的机器学习,准确预测所有 HLA II 类抗原呈递。
Sci Adv. 2023 Nov 24;9(47):eadj6367. doi: 10.1126/sciadv.adj6367.
10
Protein Sorting Prediction.蛋白质分拣预测。
Methods Mol Biol. 2024;2715:27-63. doi: 10.1007/978-1-0716-3445-5_2.

本文引用的文献

1
Identification of common molecular subsequences.常见分子子序列的鉴定
J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.
2
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.蛋白质二级结构词典:氢键和几何特征的模式识别
Biopolymers. 1983 Dec;22(12):2577-637. doi: 10.1002/bip.360221211.
3
Identification of predictive sequence motifs limited by protein structure data base size.受蛋白质结构数据库规模限制的预测序列基序的识别。
Nature. 1988 Sep 1;335(6185):45-9. doi: 10.1038/335045a0.
4
Improved tools for biological sequence comparison.用于生物序列比较的改进工具。
Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. doi: 10.1073/pnas.85.8.2444.
5
A 3D building blocks approach to analyzing and predicting structure of proteins.一种用于分析和预测蛋白质结构的3D积木方法。
Proteins. 1989;5(4):355-73. doi: 10.1002/prot.340050410.
6
Protein structure alignment.蛋白质结构比对
J Mol Biol. 1989 Jul 5;208(1):1-22. doi: 10.1016/0022-2836(89)90084-3.
7
A rapid method of protein structure alignment.一种快速的蛋白质结构比对方法。
J Theor Biol. 1990 Dec 21;147(4):517-51. doi: 10.1016/s0022-5193(05)80263-2.
8
Side-chain clusters in protein structures and their role in protein folding.蛋白质结构中的侧链簇及其在蛋白质折叠中的作用。
J Mol Biol. 1991 Jul 5;220(1):151-71. doi: 10.1016/0022-2836(91)90388-m.
9
Amino acid similarity coefficients for protein modeling and sequence alignment derived from main-chain folding angles.基于主链折叠角度得出的用于蛋白质建模和序列比对的氨基酸相似系数。
J Mol Biol. 1991 Jun 5;219(3):481-97. doi: 10.1016/0022-2836(91)90188-c.
10
The SWISS-PROT protein sequence data bank.瑞士蛋白质序列数据库。
Nucleic Acids Res. 1991 Apr 25;19 Suppl(Suppl):2247-9. doi: 10.1093/nar/19.suppl.2247.