结构相似蛋白质中的缺口：迈向多重序列比对的改进

Gaps in structurally similar proteins: towards improvement of multiple sequence alignment.

作者信息

Wrabl James O, Grishin Nick V

机构信息

Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas 75390-9050, USA.

出版信息

Proteins. 2004 Jan 1;54(1):71-87. doi: 10.1002/prot.10508.

DOI:10.1002/prot.10508

PMID:14705025

Abstract

An algorithm was developed to locally optimize gaps from the FSSP database. Over 2 million gaps were identified from all versus all FSSP structure comparisons, and datasets of non-identical gaps and flanking regions comprising between 90,000 and 135,000 sequence fragments were extracted for statistical analysis. Relative to background frequencies, gaps were enriched in residue types with small side chains and high turn propensity (D, G, N, P, S), and were depleted in residue types with hydrophobic side chains (C, F, I, L, V, W, Y). In contrast, regions flanking a gap exhibited opposite trends in amino acid frequencies, i.e., enrichment in hydrophobic residues and a high degree of secondary structure. Log-odds scores of residue type as a function of position in or around a gap were derived from the statistics. Three simple experiments demonstrated that these scores contained significant predictive information. First, regions where gaps were observed in single sequences taken from HOMSTRAD structure-based multiple sequence alignments generally scored higher than regions where gaps were not observed. Second, given the correct pairwise-aligned cores, the actual positions of gaps could be reproduced from sequence more accurately using the structurally-derived statistics than by using random pairwise alignments. Finally, revision of the Clustal-W residue-specific gap opening parameters with this new information improved the agreement of Clustal-W alignments with the structure-based alignments. At least three applications for these results are envisioned: improvement of gap penalties in pairwise (or multiple) sequence alignment, prediction of regions of single sequences likely (or unlikely) to contain indels, and more accurate placement of gaps in automated pairwise structure alignment.

摘要

开发了一种算法，用于从FSSP数据库中局部优化空位。通过对所有FSSP结构进行两两比较，识别出超过200万个空位，并提取了包含90,000至135,000个序列片段的非同一位点空位和侧翼区域数据集用于统计分析。相对于背景频率，空位在具有小侧链和高转角倾向的残基类型（D、G、N、P、S）中富集，而在具有疏水侧链的残基类型（C、F、I、L、V、W、Y）中缺失。相反，空位侧翼区域的氨基酸频率呈现相反趋势，即疏水残基富集且二级结构程度高。根据这些统计数据得出了残基类型作为空位内或周围位置函数的对数似然得分。三个简单实验表明，这些得分包含重要的预测信息。首先，从基于HOMSTRAD结构的多序列比对中获取的单序列中观察到空位的区域，其得分通常高于未观察到空位的区域。其次，在给定正确的两两比对核心的情况下，使用基于结构得出的统计数据比使用随机两两比对能更准确地从序列中重现空位的实际位置。最后，利用这些新信息对Clustal-W残基特异性空位开放参数进行修正，提高了Clustal-W比对与基于结构的比对的一致性。预计这些结果至少有三个应用：改进两两（或多序列）比对中的空位罚分、预测单序列中可能（或不太可能）包含插入缺失的区域，以及在自动两两结构比对中更准确地放置空位。

相似文献

Gaps in structurally similar proteins: towards improvement of multiple sequence alignment.

Proteins. 2004 Jan 1;54(1):71-87. doi: 10.1002/prot.10508.

Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments.

J Mol Biol. 2004 Aug 6;341(2):617-31. doi: 10.1016/j.jmb.2004.05.045.

A neural network method for prediction of beta-turn types in proteins using evolutionary information.

Bioinformatics. 2004 Nov 1;20(16):2751-8. doi: 10.1093/bioinformatics/bth322. Epub 2004 May 14.

PROMALS: towards accurate multiple sequence alignments of distantly related proteins.

Bioinformatics. 2007 Apr 1;23(7):802-8. doi: 10.1093/bioinformatics/btm017. Epub 2007 Jan 31.

Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function.

Nucleic Acids Res. 2004 May 20;32(9):2838-43. doi: 10.1093/nar/gkh610. Print 2004.

Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments.

J Mol Biol. 1996 Dec 13;264(4):823-38. doi: 10.1006/jmbi.1996.0679.

Variable gap penalty for protein sequence-structure alignment.

Protein Eng Des Sel. 2006 Mar;19(3):129-33. doi: 10.1093/protein/gzj005. Epub 2006 Jan 19.

Contact-based sequence alignment.

Nucleic Acids Res. 2004 Apr 30;32(8):2464-73. doi: 10.1093/nar/gkh566. Print 2004.

Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence.

Proteins. 2002 Nov 1;49(2):154-66. doi: 10.1002/prot.10181.

The influence of gapped positions in multiple sequence alignments on secondary structure prediction methods.

Comput Biol Chem. 2004 Dec;28(5-6):351-66. doi: 10.1016/j.compbiolchem.2004.09.005.

引用本文的文献

Faithful Interpretation of Protein Structures through Weighted Persistent Homology Improves Evolutionary Distance Estimation.

Mol Biol Evol. 2025 Feb 3;42(2). doi: 10.1093/molbev/msae271.

PC_ali: a tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score.

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad630.

Quantification of Inter-Sample Differences in T-Cell Receptor Repertoires Using Sequence-Based Information.

Front Immunol. 2017 Nov 15;8:1500. doi: 10.3389/fimmu.2017.01500. eCollection 2017.

Characterization of parasite-specific indels and their proposed relevance for selective anthelminthic drug targeting.

Infect Genet Evol. 2016 Apr;39:201-211. doi: 10.1016/j.meegid.2016.01.025. Epub 2016 Jan 30.

Measuring guide-tree dependency of inferred gaps in progressive aligners.

Bioinformatics. 2013 Apr 15;29(8):1011-7. doi: 10.1093/bioinformatics/btt095. Epub 2013 Feb 23.

Systematic analysis of short internal indels and their impact on protein folding.

BMC Struct Biol. 2010 Aug 4;10:24. doi: 10.1186/1472-6807-10-24.

New tips for structure prediction by comparative modeling.

Bioinformation. 2009;3(6):263-7. doi: 10.6026/97320630003263. Epub 2009 Jan 12.

The effectiveness of position- and composition-specific gap costs for protein similarity searches.

Bioinformatics. 2008 Jul 1;24(13):i15-23. doi: 10.1093/bioinformatics/btn171.

Aligning sequences by minimum description length.

EURASIP J Bioinform Syst Biol. 2007;2007(1):72936. doi: 10.1155/2007/72936.

DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage.

BMC Evol Biol. 2007 Oct 12;7:191. doi: 10.1186/1471-2148-7-191.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

结构相似蛋白质中的缺口：迈向多重序列比对的改进

Gaps in structurally similar proteins: towards improvement of multiple sequence alignment.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献