• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

将蛋白质序列分配到现有的域和家族分类系统:Pfam 和 PDB。

Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

机构信息

Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.

出版信息

Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.

DOI:10.1093/bioinformatics/bts533
PMID:22942020
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3476341/
Abstract

MOTIVATION

Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed.

RESULTS

We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues.

AVAILABILITY

The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

摘要

动机

将现有域和蛋白质家族分类自动分配给新的序列集是一项重要任务。当前的方法经常错过分配,因为远程关系无法达到统计显著性。由于局部比对方法经常缩短比对,因此某些分配的长度不如实际的域定义长。查询序列中的长插入通常会错误地导致为查询分配的域的两个副本。蛋白质中的发散重复序列经常被忽略。

结果

我们开发了一种多级程序,可将现有分类系统的蛋白质家族几乎完整地分配给一组大型序列。我们将其应用于将 Pfam 结构域分配给序列和蛋白质数据库(PDB)中的结构的任务。我们发现 HHsearch 比对经常在 Pfam 家族中得分更高的 Pfam 簇中更远程相关的 Pfam,从而导致 Pfam 家族级别错误分配。因此,首先应用允许部分重叠的贪婪算法对序列/HMM 比对、HMM-HMM 比对和结构比对进行处理,注意将由大插入分开的部分比对合并为单个域分配。在重复 HMM 的强分配之后,允许对重复 Pfam 进行较弱 E 值的额外分配。我们的分配数据库,以称为 PDBfam 的数据库形式呈现,包含大于 50 个残基的链的 99.4%的 Pfam。

可用性

PDBfam 中的 Pfam 分配数据可在 http://dunbrack2.fccc.edu/ProtCid/PDBfam 上获得,可通过 PDB 代码和 Pfam 标识符进行搜索。它们将定期更新。

相似文献

1
Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.将蛋白质序列分配到现有的域和家族分类系统:Pfam 和 PDB。
Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.
2
Pfam: multiple sequence alignments and HMM-profiles of protein domains.Pfam:蛋白质结构域的多序列比对和隐马尔可夫模型概况
Nucleic Acids Res. 1998 Jan 1;26(1):320-2. doi: 10.1093/nar/26.1.320.
3
Identifying protein domains with the Pfam database.使用Pfam数据库识别蛋白质结构域。
Curr Protoc Bioinformatics. 2003 May;Chapter 2:Unit 2.5. doi: 10.1002/0471250953.bi0205s01.
4
BioAssemblyModeler (BAM): user-friendly homology modeling of protein homo- and heterooligomers.生物装配建模器(BAM):用于蛋白质同聚体和异聚体的用户友好型同源建模。
PLoS One. 2014 Jun 12;9(6):e98309. doi: 10.1371/journal.pone.0098309. eCollection 2014.
5
Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins.Pfam 3.1:1313个多重比对和隐马尔可夫模型概况与大多数蛋白质匹配。
Nucleic Acids Res. 1999 Jan 1;27(1):260-2. doi: 10.1093/nar/27.1.260.
6
Pandit: a database of protein and associated nucleotide domains with inferred trees.潘迪特:一个带有推断树的蛋白质及相关核苷酸结构域数据库。
Bioinformatics. 2003 Aug 12;19(12):1556-63. doi: 10.1093/bioinformatics/btg188.
7
SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.超家族:代表所有已知结构蛋白质的隐马尔可夫模型。SCOP序列搜索、比对及基因组分配。
Nucleic Acids Res. 2002 Jan 1;30(1):268-72. doi: 10.1093/nar/30.1.268.
8
The Pfam protein families database.Pfam蛋白质家族数据库。
Nucleic Acids Res. 2002 Jan 1;30(1):276-80. doi: 10.1093/nar/30.1.276.
9
A sequence family database built on ECOD structural domains.基于 ECOD 结构域构建的序列家族数据库。
Bioinformatics. 2018 Sep 1;34(17):2997-3003. doi: 10.1093/bioinformatics/bty214.
10
The PAS fold. A redefinition of the PAS domain based upon structural prediction.PAS结构域。基于结构预测对PAS结构域的重新定义。
Eur J Biochem. 2004 Mar;271(6):1198-208. doi: 10.1111/j.1432-1033.2004.04023.x.

引用本文的文献

1
Genome-Wide Identification of Calmodulin-Binding Protein 60 Gene Family and the Function of in Cotton Growth and Development and Abiotic Stress Response.棉花中钙调蛋白结合蛋白60基因家族的全基因组鉴定及其在生长发育和非生物胁迫响应中的功能
Int J Mol Sci. 2024 Apr 15;25(8):4349. doi: 10.3390/ijms25084349.
2
GhCKX14 responding to drought stress by modulating antioxi-dative enzyme activity in Gossypium hirsutum compared to CKX family genes.GhCKX14 通过调节棉花抗氧化酶活性应对干旱胁迫,与 CKX 家族基因相比。
BMC Plant Biol. 2023 Sep 2;23(1):409. doi: 10.1186/s12870-023-04419-0.
3
The protein common assembly database (ProtCAD)-a comprehensive structural resource of protein complexes.蛋白质通用组装数据库(ProtCAD)——蛋白质复合物的综合结构资源。
Nucleic Acids Res. 2023 Jan 6;51(D1):D466-D478. doi: 10.1093/nar/gkac937.
4
Orchestrating copper binding: structure and variations on the cupredoxin fold.调控铜结合:铜氧还蛋白结构及变体。
J Biol Inorg Chem. 2022 Sep;27(6):529-540. doi: 10.1007/s00775-022-01955-2. Epub 2022 Aug 22.
5
Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection.天然和设计的类似蛋白质序列的特征有效地填补了蛋白质序列缺口:在远距离同源性检测中的意义。
Methods Mol Biol. 2022;2449:149-167. doi: 10.1007/978-1-0716-2095-3_5.
6
Isoforms from the Phytocyanin Gene Family Regulated Verticillium Wilt Resistance in Cotton.植物血蓝蛋白基因家族的同工型调控棉花黄萎病抗性。
Int J Mol Sci. 2022 Mar 8;23(6):2913. doi: 10.3390/ijms23062913.
7
Probiotic Properties of KABP042 and KABP041 Show Potential to Counteract Functional Gastrointestinal Disorders in an Observational Pilot Trial in Infants.在一项针对婴儿的观察性初步试验中,KABP042和KABP041的益生菌特性显示出对抗功能性胃肠疾病的潜力。
Front Microbiol. 2022 Jan 12;12:741391. doi: 10.3389/fmicb.2021.741391. eCollection 2021.
8
Evaluation of residue-residue contact prediction methods: From retrospective to prospective.评估残基残基接触预测方法:从回顾性到前瞻性。
PLoS Comput Biol. 2021 May 24;17(5):e1009027. doi: 10.1371/journal.pcbi.1009027. eCollection 2021 May.
9
Genomic-Wide Analysis of the PLC Family and Detection of GmPI-PLC7 Responses to Drought and Salt Stresses in Soybean.大豆中PLC家族的全基因组分析及GmPI-PLC7对干旱和盐胁迫的响应检测
Front Plant Sci. 2021 Mar 3;12:631470. doi: 10.3389/fpls.2021.631470. eCollection 2021.
10
ProtCID: a data resource for structural information on protein interactions.ProtCID:蛋白质相互作用结构信息数据库。
Nat Commun. 2020 Feb 5;11(1):711. doi: 10.1038/s41467-020-14301-4.

本文引用的文献

1
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.HHblits:通过 HMM-HMM 比对进行快速迭代的蛋白质序列搜索。
Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818.
2
The Pfam protein families database.Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.
3
The protein common interface database (ProtCID)--a comprehensive database of interactions of homologous proteins in multiple crystal forms.蛋白质公共界面数据库(ProtCID)——一个关于多种晶体形式同源蛋白质相互作用的综合数据库。
Nucleic Acids Res. 2011 Jan;39(Database issue):D761-70. doi: 10.1093/nar/gkq1059. Epub 2010 Oct 29.
4
3did: identification and classification of domain-based interactions of known three-dimensional structure.3DID:已知三维结构的基于结构域的相互作用的识别与分类
Nucleic Acids Res. 2011 Jan;39(Database issue):D718-23. doi: 10.1093/nar/gkq962. Epub 2010 Oct 21.
5
The Pfam protein families database.Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17.
6
PSI-2: structural genomics to cover protein domain family space.PSI-2:用于覆盖蛋白质结构域家族空间的结构基因组学。
Structure. 2009 Jun 10;17(6):869-81. doi: 10.1016/j.str.2009.03.015.
7
SCWRL and MolIDE: computer programs for side-chain conformation prediction and homology modeling.SCWRL和MolIDE:用于侧链构象预测和同源建模的计算机程序。
Nat Protoc. 2008;3(12):1832-47. doi: 10.1038/nprot.2008.184.
8
InterPro: the integrative protein signature database.InterPro:综合蛋白质特征数据库。
Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. doi: 10.1093/nar/gkn785. Epub 2008 Oct 21.
9
Powerful fusion: PSI-BLAST and consensus sequences.强大的融合:PSI-BLAST与共有序列
Bioinformatics. 2008 Sep 15;24(18):1987-93. doi: 10.1093/bioinformatics/btn384. Epub 2008 Aug 4.
10
Statistical analysis of interface similarity in crystals of homologous proteins.同源蛋白质晶体中界面相似性的统计分析。
J Mol Biol. 2008 Aug 29;381(2):487-507. doi: 10.1016/j.jmb.2008.06.002. Epub 2008 Jun 7.