• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

探索蛋白质结构测定和基于同源性预测的动力学,以估计超家族和折叠的数量。

Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds.

作者信息

Sadreyev Ruslan I, Grishin Nick V

机构信息

Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390-8816, USA.

出版信息

BMC Struct Biol. 2006 Mar 20;6:6. doi: 10.1186/1472-6807-6-6.

DOI:10.1186/1472-6807-6-6
PMID:16549009
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1444916/
Abstract

BACKGROUND

As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains?

RESULTS

To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database.

CONCLUSION

The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as approximately 4000 and approximately 1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.

摘要

背景

由于目前仅一小部分已知蛋白质家族具有三级结构,因此评估序列空间的哪些部分已进行结构表征非常重要。我们考虑其结构可通过与已解析结构的蛋白质的序列相似性来预测的蛋白质结构域,并解决以下问题。这些结构域是否代表所有序列家族的无偏随机样本?结构基因组计划(SGI)解析的目标是否提供这样的样本?可溶性球状结构域中基于结构的超家族和折叠的大致总数是多少?

结果

为进行这些评估,我们结合了两种方法:(i)对来自完整基因组的蛋白质进行序列分析和基于同源性的结构预测;(ii)随着实验解析结构的积累,及时监测已分配结构集的动态变化。在直系同源群(COG)数据库中,我们将结构表征的结构域家族不断增长的群体映射到基于序列的结构域之间的连接网络上。这种映射揭示了一种系统性偏差,表明用于结构确定的目标家族往往位于序列空间中人口密集的区域。相比之下,最初由SGI推断其结构的结构域子集类似于来自整个人口的随机样本。为了适应观察到的偏差,我们提出了一种新的非参数方法来估计结构超家族和折叠的总数,该方法不依赖于采样过程的特定模型。基于不断增长的结构预测集中基于稳健分布的参数的动态变化,我们估计了COG数据库中可溶性球状蛋白质中超家族和折叠的总数。

结论

当前解析的蛋白质结构集允许在大约三分之一基于序列的结构域家族中进行结构预测。结构确定目标的选择偏向于具有许多基于序列的同源物的结构域。未来不断增加的SGI产出应进一步有助于减少这种偏差。COG数据库中结构超家族和折叠的总数估计约为4000和约1700。这些数字分别比目前可分配给COG蛋白质的超家族和折叠的数量高出四倍和三倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/c3314fc0ed72/1472-6807-6-6-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/bcd18be0df68/1472-6807-6-6-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/532e15821990/1472-6807-6-6-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/fc90e23ebe21/1472-6807-6-6-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/ab3536b461eb/1472-6807-6-6-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/c3314fc0ed72/1472-6807-6-6-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/bcd18be0df68/1472-6807-6-6-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/532e15821990/1472-6807-6-6-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/fc90e23ebe21/1472-6807-6-6-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/ab3536b461eb/1472-6807-6-6-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b3/1444916/c3314fc0ed72/1472-6807-6-6-5.jpg

相似文献

1
Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds.探索蛋白质结构测定和基于同源性预测的动力学,以估计超家族和折叠的数量。
BMC Struct Biol. 2006 Mar 20;6:6. doi: 10.1186/1472-6807-6-6.
2
Progress of structural genomics initiatives: an analysis of solved target structures.结构基因组学计划的进展:已解析目标结构的分析
J Mol Biol. 2005 May 20;348(5):1235-60. doi: 10.1016/j.jmb.2005.03.037. Epub 2005 Apr 2.
3
GenDiS: Genomic Distribution of protein structural domain Superfamilies.GenDiS:蛋白质结构域超家族的基因组分布
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D252-5. doi: 10.1093/nar/gki087.
4
Defining the fold space of membrane proteins: the CAMPS database.定义膜蛋白的折叠空间:CAMPS数据库。
Proteins. 2006 Sep 1;64(4):906-22. doi: 10.1002/prot.21081.
5
Selecting targets for structural determination by navigating in a graph of protein families.通过在蛋白质家族图谱中导航来选择用于结构测定的靶标。
Bioinformatics. 2002 Jul;18(7):899-907. doi: 10.1093/bioinformatics/18.7.899.
6
Structural diversity of domain superfamilies in the CATH database.CATH数据库中结构域超家族的结构多样性。
J Mol Biol. 2006 Jul 14;360(3):725-41. doi: 10.1016/j.jmb.2006.05.035. Epub 2006 Jun 2.
7
PASS2: an automated database of protein alignments organised as structural superfamilies.PASS2:一个以结构超家族形式组织的蛋白质比对自动化数据库。
BMC Bioinformatics. 2004 Apr 2;5:35. doi: 10.1186/1471-2105-5-35.
8
The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies.重温CATH分类——超家族中结构差异的架构综述及新表征方法
Nucleic Acids Res. 2009 Jan;37(Database issue):D310-4. doi: 10.1093/nar/gkn877. Epub 2008 Nov 7.
9
SUPFAM: a database of sequence superfamilies of protein domains.SUPFAM:一个蛋白质结构域序列超家族数据库。
BMC Bioinformatics. 2004 Mar 15;5:28. doi: 10.1186/1471-2105-5-28.
10
The size distribution of protein families within different types of folds.不同折叠类型中蛋白质家族的大小分布。
Biochem Biophys Res Commun. 2011 Mar 11;406(2):218-22. doi: 10.1016/j.bbrc.2011.02.020. Epub 2011 Feb 15.

引用本文的文献

1
Small Molecule Wnt Pathway Modulators from Natural Sources: History, State of the Art and Perspectives.天然小分子 Wnt 通路调节剂:历史、现状与展望。
Cells. 2020 Mar 2;9(3):589. doi: 10.3390/cells9030589.
2
The Anticancer Drug Discovery Potential of Marine Invertebrates from Russian Pacific.俄罗斯太平洋海洋无脊椎动物的抗癌药物发现潜力。
Mar Drugs. 2019 Aug 16;17(8):474. doi: 10.3390/md17080474.
3
CASP13 target classification into tertiary structure prediction categories.CASP13 目标分类到三级结构预测类别。

本文引用的文献

1
A tale of two ferredoxins: sequence similarity and structural differences.两种铁氧化还原蛋白的故事:序列相似性与结构差异
BMC Struct Biol. 2006 Apr 9;6:8. doi: 10.1186/1472-6807-6-8.
2
Structure-based functional identification of a novel heme-binding protein from Thermus thermophilus HB8.基于结构的嗜热栖热菌HB8新型血红素结合蛋白的功能鉴定
J Struct Funct Genomics. 2005;6(1):21-32. doi: 10.1007/s10969-005-1103-x.
3
Progress of structural genomics initiatives: an analysis of solved target structures.结构基因组学计划的进展:已解析目标结构的分析
Proteins. 2019 Dec;87(12):1021-1036. doi: 10.1002/prot.25775. Epub 2019 Jul 24.
4
Bacterial protein structures reveal phylum dependent divergence.细菌蛋白结构揭示门水平的分化。
Comput Biol Chem. 2011 Feb;35(1):24-33. doi: 10.1016/j.compbiolchem.2010.12.004. Epub 2011 Jan 18.
5
Cholera- and anthrax-like toxins are among several new ADP-ribosyltransferases.霍乱毒素和炭疽毒素是几种新的 ADP- 核糖基转移酶之一。
PLoS Comput Biol. 2010 Dec 9;6(12):e1001029. doi: 10.1371/journal.pcbi.1001029.
6
Foldon-guided self-assembly of ultra-stable protein fibers.Foldon引导的超稳定蛋白质纤维的自组装。
Protein Sci. 2008 Sep;17(9):1475-85. doi: 10.1110/ps.036111.108. Epub 2008 Jun 5.
7
Preservation of protein clefts in comparative models.比较模型中蛋白质裂隙的保留
BMC Struct Biol. 2008 Jan 16;8:2. doi: 10.1186/1472-6807-8-2.
8
A comprehensive system for evaluation of remote sequence similarity detection.一种用于评估远程序列相似性检测的综合系统。
BMC Bioinformatics. 2007 Aug 28;8:314. doi: 10.1186/1471-2105-8-314.
9
Efficient identification of critical residues based only on protein structure by network analysis.仅基于网络分析的蛋白质结构进行关键残基的有效鉴定。
PLoS One. 2007 May 9;2(5):e421. doi: 10.1371/journal.pone.0000421.
10
Growth of novel protein structural data.新型蛋白质结构数据的增长。
Proc Natl Acad Sci U S A. 2007 Feb 27;104(9):3183-8. doi: 10.1073/pnas.0611678104. Epub 2007 Feb 20.
J Mol Biol. 2005 May 20;348(5):1235-60. doi: 10.1016/j.jmb.2005.03.037. Epub 2005 Apr 2.
4
GenBank.基因银行
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D34-8. doi: 10.1093/nar/gki063.
5
InterPro, progress and status in 2005.InterPro 2005年的进展与现状
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D201-5. doi: 10.1093/nar/gki106.
6
ADDA: a domain database with global coverage of the protein universe.ADDA:一个覆盖蛋白质全域的领域数据库。
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D188-91. doi: 10.1093/nar/gki096.
7
The Universal Protein Resource (UniProt).通用蛋白质资源(UniProt)。
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D154-9. doi: 10.1093/nar/gki070.
8
Toward consistent assignment of structural domains in proteins.迈向蛋白质结构域的一致分配
J Mol Biol. 2004 Jun 4;339(3):647-78. doi: 10.1016/j.jmb.2004.03.053.
9
Structure and function of a hypothetical Pseudomonas aeruginosa protein PA1167 classified into family PL-7: a novel alginate lyase with a beta-sandwich fold.一种被归类于PL-7家族的假单胞菌铜绿假单胞菌蛋白PA1167的结构与功能:一种具有β-三明治折叠的新型藻酸盐裂解酶
J Biol Chem. 2004 Jul 23;279(30):31863-72. doi: 10.1074/jbc.M402466200. Epub 2004 May 10.
10
SCOP database in 2004: refinements integrate structure and sequence family data.2004年的SCOP数据库:改进整合了结构和序列家族数据。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D226-9. doi: 10.1093/nar/gkh039.