Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.
Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.
MOTIVATION: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. RESULTS: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. AVAILABILITY: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.
动机:将现有域和蛋白质家族分类自动分配给新的序列集是一项重要任务。当前的方法经常错过分配,因为远程关系无法达到统计显著性。由于局部比对方法经常缩短比对,因此某些分配的长度不如实际的域定义长。查询序列中的长插入通常会错误地导致为查询分配的域的两个副本。蛋白质中的发散重复序列经常被忽略。
结果:我们开发了一种多级程序,可将现有分类系统的蛋白质家族几乎完整地分配给一组大型序列。我们将其应用于将 Pfam 结构域分配给序列和蛋白质数据库(PDB)中的结构的任务。我们发现 HHsearch 比对经常在 Pfam 家族中得分更高的 Pfam 簇中更远程相关的 Pfam,从而导致 Pfam 家族级别错误分配。因此,首先应用允许部分重叠的贪婪算法对序列/HMM 比对、HMM-HMM 比对和结构比对进行处理,注意将由大插入分开的部分比对合并为单个域分配。在重复 HMM 的强分配之后,允许对重复 Pfam 进行较弱 E 值的额外分配。我们的分配数据库,以称为 PDBfam 的数据库形式呈现,包含大于 50 个残基的链的 99.4%的 Pfam。
可用性:PDBfam 中的 Pfam 分配数据可在 http://dunbrack2.fccc.edu/ProtCid/PDBfam 上获得,可通过 PDB 代码和 Pfam 标识符进行搜索。它们将定期更新。
Nucleic Acids Res. 1998-1-1
Curr Protoc Bioinformatics. 2003-5
Nucleic Acids Res. 1999-1-1
Bioinformatics. 2003-8-12
Nucleic Acids Res. 2002-1-1
Bioinformatics. 2018-9-1
Eur J Biochem. 2004-3
Nucleic Acids Res. 2023-1-6
J Biol Inorg Chem. 2022-9
Int J Mol Sci. 2022-3-8
PLoS Comput Biol. 2021-5
Nat Commun. 2020-2-5
Nat Methods. 2011-12-25
Nucleic Acids Res. 2011-11-29
Nucleic Acids Res. 2009-11-17
Structure. 2009-6-10
Nucleic Acids Res. 2009-1
Bioinformatics. 2008-9-15
J Mol Biol. 2008-8-29