利用同义蛋白质词的评估函数提高一致性比对器的比对质量。

Improving the alignment quality of consistency based aligners with an evaluation function using synonymous protein words.

机构信息

Bioinformatics Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan.

出版信息

PLoS One. 2011;6(12):e27872. doi: 10.1371/journal.pone.0027872. Epub 2011 Dec 2.

DOI:10.1371/journal.pone.0027872

PMID:22163274

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3229492/

Abstract

Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.

摘要

大多数序列比对工具都可以成功地将具有较高序列同一性的蛋白质序列进行比对。然而，当考虑到亲缘关系较远的序列（<20%同一性）时，相应结构比对的准确性会迅速下降。在这个同一性范围内，为了最大化序列相似性而优化的比对通常从结构角度来看是不准确的。在过去的二十年中，大多数多蛋白质比对器都针对其基于序列信息复制结构比对的能力进行了优化。目前可用的方法在使用替换矩阵、傅里叶变换、复杂的轮廓-轮廓函数或基于一致性的方法进行对齐残基之间的相似性测量方面存在本质区别，最近还出现了一些基于一致性的方法。在本文中，我们提出了一种灵活的残基对相似性度量方法，以提高蛋白质序列比对的质量。我们的方法称为 SymAlign，它依赖于在相当大的数据集部分中发现的保守词的识别，并得到进化分析的支持。然后，这些词用于定义一个位置特定的替换矩阵，该矩阵更好地反映局部相似性的生物学意义。实验结果表明，SymAlign 评分方案可以被整合到 T-Coffee 中以提高序列比对的准确性。我们还证明 SymAlign 对结构上不相似的蛋白质的存在不那么敏感。在序列同一性和结构相似性之间的关系分析中，SymAlign 可以更好地区分结构相似的蛋白质和非相似的蛋白质。我们表明，通过基于加权 n-gram 的相似性估计，可以显著改进蛋白质序列比对。在我们对由此产生的比对的分析中，序列保守性成为结构相似性的更好指标。SymAlign 还提供了对齐可视化功能，可以在点矩阵上显示次优对齐。可视化功能使识别可能未被动态编程识别的有充分支持的替代对齐变得容易。SymAlign 可在 http://bio-cluster.iis.sinica.edu.tw/SymAlign/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7cc/3229492/2a1deea8045d/pone.0027872.g001.jpg

相似文献

Improving the alignment quality of consistency based aligners with an evaluation function using synonymous protein words.

PLoS One. 2011;6(12):e27872. doi: 10.1371/journal.pone.0027872. Epub 2011 Dec 2.

High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABER-TOOTH.

BMC Bioinformatics. 2010 May 14;11:251. doi: 10.1186/1471-2105-11-251.

Adaptive Smith-Waterman residue match seeding for protein structural alignment.

Proteins. 2013 Oct;81(10):1823-39. doi: 10.1002/prot.24327. Epub 2013 Aug 19.

Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score.

BMC Bioinformatics. 2008 Dec 12;9:531. doi: 10.1186/1471-2105-9-531.

Improving protein secondary structure prediction based on short subsequences with local structure similarity.

BMC Genomics. 2010 Dec 2;11 Suppl 4(Suppl 4):S4. doi: 10.1186/1471-2164-11-S4-S4.

Accuracy of structure-based sequence alignment of automatic methods.

BMC Bioinformatics. 2007 Sep 20;8:355. doi: 10.1186/1471-2105-8-355.

Adjusting scoring matrices to correct overextended alignments.

Bioinformatics. 2013 Dec 1;29(23):3007-13. doi: 10.1093/bioinformatics/btt517. Epub 2013 Aug 31.

OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.

BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.

Structure-dependent sequence alignment for remotely related proteins.

Bioinformatics. 2002 Dec;18(12):1658-65. doi: 10.1093/bioinformatics/18.12.1658.

CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area.

PLoS One. 2015 Oct 26;10(10):e0141440. doi: 10.1371/journal.pone.0141440. eCollection 2015.

引用本文的文献

Identifying functionally informative evolutionary sequence profiles.

Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.

On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation.

BMC Bioinformatics. 2014 Jun 2;15:166. doi: 10.1186/1471-2105-15-166.

本文引用的文献

Improving protein secondary structure prediction based on short subsequences with local structure similarity.

BMC Genomics. 2010 Dec 2;11 Suppl 4(Suppl 4):S4. doi: 10.1186/1471-2164-11-S4-S4.

MTRAP: pairwise sequence alignment algorithm by a new measure based on transition probability between two consecutive pairs of residues.

BMC Bioinformatics. 2010 May 8;11:235. doi: 10.1186/1471-2105-11-235.

SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction.

Nucleic Acids Res. 2010 Jul;38(Web Server issue):W29-34. doi: 10.1093/nar/gkq298. Epub 2010 Apr 29.

How significant is a protein structure similarity with TM-score = 0.5?

Bioinformatics. 2010 Apr 1;26(7):889-95. doi: 10.1093/bioinformatics/btq066. Epub 2010 Feb 17.

Quality measures for protein alignment benchmarks.

Nucleic Acids Res. 2010 Apr;38(7):2145-53. doi: 10.1093/nar/gkp1196. Epub 2010 Jan 4.

Optimizing substitution matrix choice and gap parameters for sequence alignment.

BMC Bioinformatics. 2009 Dec 2;10:396. doi: 10.1186/1471-2105-10-396.

Upcoming challenges for multiple sequence alignment methods in the high-throughput era.

Bioinformatics. 2009 Oct 1;25(19):2455-65. doi: 10.1093/bioinformatics/btp452. Epub 2009 Jul 30.

Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre.

Proteins. 2008 Feb 15;70(3):611-25. doi: 10.1002/prot.21688.

Clustal W and Clustal X version 2.0.

Bioinformatics. 2007 Nov 1;23(21):2947-8. doi: 10.1093/bioinformatics/btm404. Epub 2007 Sep 10.

PROMALS: towards accurate multiple sequence alignments of distantly related proteins.

Bioinformatics. 2007 Apr 1;23(7):802-8. doi: 10.1093/bioinformatics/btm017. Epub 2007 Jan 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用同义蛋白质词的评估函数提高一致性比对器的比对质量。

Improving the alignment quality of consistency based aligners with an evaluation function using synonymous protein words.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献