一种用于评估多序列比对质量的统计评分。

A statistical score for assessing the quality of multiple sequence alignments.

作者信息

Ahola Virpi, Aittokallio Tero, Vihinen Mauno, Uusipaikka Esa

机构信息

Biotechnology and Food Research, MTT Agrifood Research Finland, Jokioinen, Finland.

出版信息

BMC Bioinformatics. 2006 Nov 3;7:484. doi: 10.1186/1471-2105-7-484.

DOI:10.1186/1471-2105-7-484

PMID:17081313

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1687212/

Abstract

BACKGROUND

Multiple sequence alignment is the foundation of many important applications in bioinformatics that aim at detecting functionally important regions, predicting protein structures, building phylogenetic trees etc. Although the automatic construction of a multiple sequence alignment for a set of remotely related sequences cause a very challenging and error-prone task, many downstream analyses still rely heavily on the accuracy of the alignments.

RESULTS

To address the need for an objective evaluation framework, we introduce a statistical score that assesses the quality of a given multiple sequence alignment. The quality assessment is based on counting the number of significantly conserved positions in the alignment using importance sampling method in conjunction with statistical profile analysis framework. We first evaluate a novel objective function used in the alignment quality score for measuring the positional conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13, subtilase and beta-lactamase families demonstrate that the score can distinguish sequence patterns with different degrees of conservation. Secondly, we evaluate the quality of the alignments produced by several widely used multiple sequence alignment programs using a novel alignment quality score and a commonly used sum of pairs method. According to these results, the Mafft strategy L-INS-i outperforms the other methods, although the difference between the Probcons, TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar results than the sum of pairs method.

CONCLUSION

The results indicate that the proposed statistical score is useful in assessing the quality of multiple sequence alignments.

摘要

背景

多序列比对是生物信息学中许多重要应用的基础，这些应用旨在检测功能重要区域、预测蛋白质结构、构建系统发育树等。尽管为一组远缘相关序列自动构建多序列比对是一项极具挑战性且容易出错的任务，但许多下游分析仍然严重依赖比对的准确性。

结果

为满足对客观评估框架的需求，我们引入了一种统计得分来评估给定多序列比对的质量。质量评估基于使用重要性抽样方法结合统计概况分析框架来计算比对中显著保守位置的数量。我们首先评估用于比对质量得分以测量位置保守性的一种新型目标函数。Src同源2（SH2）结构域、类Ras蛋白、肽酶M13、枯草杆菌蛋白酶和β-内酰胺酶家族的结果表明，该得分可以区分不同保守程度的序列模式。其次，我们使用一种新型比对质量得分和常用的双序列比对得分总和方法，评估了几种广泛使用的多序列比对程序产生的比对质量。根据这些结果，Mafft策略L-INS-i优于其他方法，尽管Probcons、TCoffee和Muscle之间的差异大多不显著。新型比对质量得分与双序列比对得分总和方法提供了相似的结果。

结论

结果表明，所提出的统计得分在评估多序列比对质量方面是有用的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ed6/1687212/d1d85330404d/1471-2105-7-484-1.jpg

相似文献

A statistical score for assessing the quality of multiple sequence alignments.

BMC Bioinformatics. 2006 Nov 3;7:484. doi: 10.1186/1471-2105-7-484.

Model-based prediction of sequence alignment quality.

Bioinformatics. 2008 Oct 1;24(19):2165-71. doi: 10.1093/bioinformatics/btn414. Epub 2008 Aug 4.

Improvement in the accuracy of multiple sequence alignment program MAFFT.

Genome Inform. 2005;16(1):22-33.

Bayesian coestimation of phylogeny and sequence alignment.

BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.

DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.

BMC Bioinformatics. 2005 Mar 22;6:66. doi: 10.1186/1471-2105-6-66.

PROMALS web server for accurate multiple protein sequence alignments.

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W649-52. doi: 10.1093/nar/gkm227. Epub 2007 Apr 22.

An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.

J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975.

OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.

BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.

Accuracy of structure-based sequence alignment of automatic methods.

BMC Bioinformatics. 2007 Sep 20;8:355. doi: 10.1186/1471-2105-8-355.

Statistics of local multiple alignments.

Bioinformatics. 2005 Jun;21 Suppl 1:i344-50. doi: 10.1093/bioinformatics/bti1042.

引用本文的文献

Genome-wide identification, expression analysis, and stress response analysis of the RdbZIP gene family in Rhododendron delavayi.

BMC Plant Biol. 2025 May 26;25(1):701. doi: 10.1186/s12870-025-06737-x.

Ancestral Sequence Reconstruction for Exploring Alkaloid Evolution.

Methods Mol Biol. 2022;2505:165-179. doi: 10.1007/978-1-0716-2349-7_12.

Identifying functionally informative evolutionary sequence profiles.

Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.

Determination of optimal parameters of MAFFT program based on BAliBASE3.0 database.

Springerplus. 2016 Jun 16;5(1):736. doi: 10.1186/s40064-016-2526-5. eCollection 2016.

Computational approaches to study the effects of small genomic variations.

J Mol Model. 2015 Oct;21(10):251. doi: 10.1007/s00894-015-2794-y. Epub 2015 Sep 8.

AST: an automated sequence-sampling method for improving the taxonomic diversity of gene phylogenetic trees.

PLoS One. 2014 Jun 3;9(6):e98844. doi: 10.1371/journal.pone.0098844. eCollection 2014.

Early lignin pathway enzymes and routes to chlorogenic acid in switchgrass (Panicum virgatum L.).

Plant Mol Biol. 2014 Mar;84(4-5):565-76. doi: 10.1007/s11103-013-0152-y. Epub 2013 Nov 5.

Accuracy estimation and parameter advising for protein multiple sequence alignment.

J Comput Biol. 2013 Apr;20(4):259-79. doi: 10.1089/cmb.2013.0007. Epub 2013 Mar 14.

Heuristic methods for finding pathogenic variants in gene coding sequences.

J Am Heart Assoc. 2012 Oct;1(5):e002642. doi: 10.1161/JAHA.112.002642. Epub 2012 Oct 25.

Evolution and function of the plant cell wall synthesis-related glycosyltransferase family 8.

Plant Physiol. 2010 Aug;153(4):1729-46. doi: 10.1104/pp.110.154229. Epub 2010 Jun 3.

本文引用的文献

Statistical methods for identifying conserved residues in multiple sequence alignment.

Stat Appl Genet Mol Biol. 2004;3:Article28. doi: 10.2202/1544-6115.1074. Epub 2004 Oct 30.

Automatic assessment of alignment quality.

Nucleic Acids Res. 2005 Dec 16;33(22):7120-8. doi: 10.1093/nar/gki1020. Print 2005.

ProbCons: Probabilistic consistency-based multiple sequence alignment.

Genome Res. 2005 Feb;15(2):330-40. doi: 10.1101/gr.2821705.

MAFFT version 5: improvement in accuracy of multiple sequence alignment.

Nucleic Acids Res. 2005 Jan 20;33(2):511-8. doi: 10.1093/nar/gki198. Print 2005.

Conservation and covariance in PH domain sequences: physicochemical profile and information theoretical analysis of XLA-causing mutations in the Btk PH domain.

Protein Eng Des Sel. 2004 Mar;17(3):267-76. doi: 10.1093/protein/gzh030. Epub 2004 Apr 13.

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Nucleic Acids Res. 2004 Mar 19;32(5):1792-7. doi: 10.1093/nar/gkh340. Print 2004.

The Pfam protein families database.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D138-41. doi: 10.1093/nar/gkh121.

Efficient estimation of emission probabilities in profile hidden Markov models.

Bioinformatics. 2003 Dec 12;19(18):2359-68. doi: 10.1093/bioinformatics/btg328.

Identification of functionally conserved residues with the use of entropy-variability plots.

Proteins. 2003 Sep 1;52(4):544-52. doi: 10.1002/prot.10490.

APDB: a novel measure for benchmarking sequence alignment methods without reference alignments.

Bioinformatics. 2003;19 Suppl 1:i215-21. doi: 10.1093/bioinformatics/btg1029.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于评估多序列比对质量的统计评分。

A statistical score for assessing the quality of multiple sequence alignments.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献