文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

将蛋白质序列分配到现有的域和家族分类系统:Pfam 和 PDB。

Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

机构信息

Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.

出版信息

Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.


DOI:10.1093/bioinformatics/bts533
PMID:22942020
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3476341/
Abstract

MOTIVATION: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. RESULTS: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. AVAILABILITY: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

摘要

动机:将现有域和蛋白质家族分类自动分配给新的序列集是一项重要任务。当前的方法经常错过分配,因为远程关系无法达到统计显著性。由于局部比对方法经常缩短比对,因此某些分配的长度不如实际的域定义长。查询序列中的长插入通常会错误地导致为查询分配的域的两个副本。蛋白质中的发散重复序列经常被忽略。

结果:我们开发了一种多级程序,可将现有分类系统的蛋白质家族几乎完整地分配给一组大型序列。我们将其应用于将 Pfam 结构域分配给序列和蛋白质数据库(PDB)中的结构的任务。我们发现 HHsearch 比对经常在 Pfam 家族中得分更高的 Pfam 簇中更远程相关的 Pfam,从而导致 Pfam 家族级别错误分配。因此,首先应用允许部分重叠的贪婪算法对序列/HMM 比对、HMM-HMM 比对和结构比对进行处理,注意将由大插入分开的部分比对合并为单个域分配。在重复 HMM 的强分配之后,允许对重复 Pfam 进行较弱 E 值的额外分配。我们的分配数据库,以称为 PDBfam 的数据库形式呈现,包含大于 50 个残基的链的 99.4%的 Pfam。

可用性:PDBfam 中的 Pfam 分配数据可在 http://dunbrack2.fccc.edu/ProtCid/PDBfam 上获得,可通过 PDB 代码和 Pfam 标识符进行搜索。它们将定期更新。

相似文献

[1]
Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

Bioinformatics. 2012-8-31

[2]
Pfam: multiple sequence alignments and HMM-profiles of protein domains.

Nucleic Acids Res. 1998-1-1

[3]
Identifying protein domains with the Pfam database.

Curr Protoc Bioinformatics. 2003-5

[4]
BioAssemblyModeler (BAM): user-friendly homology modeling of protein homo- and heterooligomers.

PLoS One. 2014-6-12

[5]
Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins.

Nucleic Acids Res. 1999-1-1

[6]
Pandit: a database of protein and associated nucleotide domains with inferred trees.

Bioinformatics. 2003-8-12

[7]
SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.

Nucleic Acids Res. 2002-1-1

[8]
The Pfam protein families database.

Nucleic Acids Res. 2002-1-1

[9]
A sequence family database built on ECOD structural domains.

Bioinformatics. 2018-9-1

[10]
The PAS fold. A redefinition of the PAS domain based upon structural prediction.

Eur J Biochem. 2004-3

引用本文的文献

[1]
Genome-Wide Identification of Calmodulin-Binding Protein 60 Gene Family and the Function of in Cotton Growth and Development and Abiotic Stress Response.

Int J Mol Sci. 2024-4-15

[2]
GhCKX14 responding to drought stress by modulating antioxi-dative enzyme activity in Gossypium hirsutum compared to CKX family genes.

BMC Plant Biol. 2023-9-2

[3]
The protein common assembly database (ProtCAD)-a comprehensive structural resource of protein complexes.

Nucleic Acids Res. 2023-1-6

[4]
Orchestrating copper binding: structure and variations on the cupredoxin fold.

J Biol Inorg Chem. 2022-9

[5]
Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection.

Methods Mol Biol. 2022

[6]
Isoforms from the Phytocyanin Gene Family Regulated Verticillium Wilt Resistance in Cotton.

Int J Mol Sci. 2022-3-8

[7]
Probiotic Properties of KABP042 and KABP041 Show Potential to Counteract Functional Gastrointestinal Disorders in an Observational Pilot Trial in Infants.

Front Microbiol. 2022-1-12

[8]
Evaluation of residue-residue contact prediction methods: From retrospective to prospective.

PLoS Comput Biol. 2021-5

[9]
Genomic-Wide Analysis of the PLC Family and Detection of GmPI-PLC7 Responses to Drought and Salt Stresses in Soybean.

Front Plant Sci. 2021-3-3

[10]
ProtCID: a data resource for structural information on protein interactions.

Nat Commun. 2020-2-5

本文引用的文献

[1]
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Nat Methods. 2011-12-25

[2]
The Pfam protein families database.

Nucleic Acids Res. 2011-11-29

[3]
The protein common interface database (ProtCID)--a comprehensive database of interactions of homologous proteins in multiple crystal forms.

Nucleic Acids Res. 2011-1

[4]
3did: identification and classification of domain-based interactions of known three-dimensional structure.

Nucleic Acids Res. 2011-1

[5]
The Pfam protein families database.

Nucleic Acids Res. 2009-11-17

[6]
PSI-2: structural genomics to cover protein domain family space.

Structure. 2009-6-10

[7]
SCWRL and MolIDE: computer programs for side-chain conformation prediction and homology modeling.

Nat Protoc. 2008

[8]
InterPro: the integrative protein signature database.

Nucleic Acids Res. 2009-1

[9]
Powerful fusion: PSI-BLAST and consensus sequences.

Bioinformatics. 2008-9-15

[10]
Statistical analysis of interface similarity in crystals of homologous proteins.

J Mol Biol. 2008-8-29

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索