PFASUM：一种来自Pfam结构比对的替换矩阵。

PFASUM: a substitution matrix from Pfam structural alignments.

作者信息

Keul Frank, Hess Martin, Goesele Michael, Hamacher Kay

机构信息

Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany.

Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany.

出版信息

BMC Bioinformatics. 2017 Jun 5;18(1):293. doi: 10.1186/s12859-017-1703-z.

DOI:10.1186/s12859-017-1703-z

PMID:28583067

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5460430/

Abstract

BACKGROUND

Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities. We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space.

RESULTS

We show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set. Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences.

CONCLUSIONS

We present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks.

摘要

背景

检测同源蛋白质序列和计算多序列比对（MSA）是分子生物信息学中的基本任务。这些任务通常需要一个替换矩阵来模拟从一组比对序列中得出的进化替换事件。在过去几年中，已知的序列空间急剧增加，一些出版物表明这可以导致性能显著更好的矩阵。有趣的是，基于过时序列数据集的矩阵仍然是这两项任务的事实上的标准，尽管它们的数据基础可能会限制其能力。我们通过提出一个名为PFASUM的新替换矩阵系列来解决这些问题。这些矩阵是使用一种新颖的算法从Pfam种子MSA中推导出来的，因此建立在覆盖广泛且多样的序列空间的专家真值数据之上。

结果

我们展示了两个用例的结果：第一，我们在具有不同序列相似性的最新ASTRAL数据库上测试了PFASUM矩阵的同源性搜索性能。我们的研究表明，与传统矩阵相比，使用PFASUM矩阵可以显著提高同源性搜索结果。与常用替换矩阵BLOSUM50、BLOSUM62、PAM250、VTML160和VTML200具有可比相对熵的PFASUM矩阵在所有测试案例的93%中优于其相应的对应矩阵。一项比较不同相对熵矩阵的综合评估还表明，PFASUM矩阵在测试集中提供了最佳的同源性搜索性能。第二，我们的结果表明，与传统矩阵相比，使用PFASUM矩阵进行MSA构建可提高其质量。在最新的MSA基准测试中，当使用带有PFASUM31、PFASUM43和PFASUM60矩阵的MUSCLE而不是传统矩阵时，所有MSA中至少60%被重建为同等或更高质量。对于包含相似序列的MSA，这一比例甚至增加到至少76%。

结论

我们提出了从手动策划的MSA真值数据推导而来的新颖的PFASUM替换矩阵，该数据覆盖了当前已知的序列空间。我们的结果表明，与传统替换矩阵相比，PFASUM矩阵在许多情况下提高了同源性搜索性能以及MSA质量。因此，我们鼓励在这些特定任务中使用PFASUM矩阵，尤其是PFASUM60。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e686/5460430/483b95703bbe/12859_2017_1703_Fig1_HTML.jpg

相似文献

PFASUM: a substitution matrix from Pfam structural alignments.

BMC Bioinformatics. 2017 Jun 5;18(1):293. doi: 10.1186/s12859-017-1703-z.

Addressing inaccuracies in BLOSUM computation improves homology search performance.

BMC Bioinformatics. 2016 Apr 27;17:189. doi: 10.1186/s12859-016-1060-3.

The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion.

BMC Bioinformatics. 2020 Sep 14;21(Suppl 11):294. doi: 10.1186/s12859-020-03616-0.

RPfam: A refiner towards curated-like multiple sequence alignments of the Pfam protein families.

J Bioinform Comput Biol. 2022 Aug;20(4):2240002. doi: 10.1142/S0219720022400029. Epub 2022 Apr 14.

Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix.

BMC Bioinformatics. 2015 Aug 14;16:255. doi: 10.1186/s12859-015-0688-8.

Context-specific amino acid substitution matrices and their use in the detection of protein homologs.

Proteins. 2008 May 1;71(2):910-9. doi: 10.1002/prot.21775.

Fold-specific sequence scoring improves protein sequence matching.

BMC Bioinformatics. 2016 Aug 30;17(1):328. doi: 10.1186/s12859-016-1198-z.

Optimizing substitution matrices by separating score distributions.

Bioinformatics. 2004 Apr 12;20(6):863-73. doi: 10.1093/bioinformatics/btg494. Epub 2004 Jan 29.

RBLOSUM performs better than CorBLOSUM with lesser error per query.

BMC Res Notes. 2018 May 21;11(1):328. doi: 10.1186/s13104-018-3415-5.

Selecting the Right Similarity-Scoring Matrix.

Curr Protoc Bioinformatics. 2013;43:3.5.1-3.5.9. doi: 10.1002/0471250953.bi0305s43.

引用本文的文献

Accurate detection of tandem repeats exposes ubiquitous reuse of biological sequences.

Nucleic Acids Res. 2025 Sep 5;53(17). doi: 10.1093/nar/gkaf866.

Tandem Repeats Provide Evidence for Convergent Evolution to Similar Protein Structures.

Genome Biol Evol. 2025 Feb 3;17(2). doi: 10.1093/gbe/evaf013.

A BLAST from the past: revisiting blastp's E-value.

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae729.

tcrBLOSUM: an amino acid substitution matrix for sensitive alignment of distant epitope-specific TCRs.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae602.

SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences.

Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9.

RESP2: An uncertainty aware multi-target multi-property optimization AI pipeline for antibody discovery.

bioRxiv. 2025 Mar 9:2024.07.30.605700. doi: 10.1101/2024.07.30.605700.

Computational scoring and experimental evaluation of enzymes generated by neural networks.

Nat Biotechnol. 2025 Mar;43(3):396-405. doi: 10.1038/s41587-024-02214-2. Epub 2024 Apr 23.

Accurately clustering biological sequences in linear time by relatedness sorting.

Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9.

Protein embedding based alignment.

BMC Bioinformatics. 2024 Feb 28;25(1):85. doi: 10.1186/s12859-024-05699-5.

New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions.

Front Bioinform. 2023 Oct 12;3:1227193. doi: 10.3389/fbinf.2023.1227193. eCollection 2023.

本文引用的文献

Addressing inaccuracies in BLOSUM computation improves homology search performance.

BMC Bioinformatics. 2016 Apr 27;17:189. doi: 10.1186/s12859-016-1060-3.

The Pfam protein families database: towards a more sustainable future.

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

Parameterized BLOSUM Matrices for Protein Alignment.

IEEE/ACM Trans Comput Biol Bioinform. 2015 May-Jun;12(3):686-94. doi: 10.1109/TCBB.2014.2366126.

Visual exploration of parameter influence on phylogenetic trees.

IEEE Comput Graph Appl. 2014 Mar-Apr;34(2):48-56. doi: 10.1109/MCG.2014.2.

SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures.

Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9. doi: 10.1093/nar/gkt1240. Epub 2013 Dec 3.

A new generation of homology search tools based on probabilistic inference.

Genome Inform. 2009 Oct;23(1):205-11.

Optimizing substitution matrix choice and gap parameters for sequence alignment.

BMC Bioinformatics. 2009 Dec 2;10:396. doi: 10.1186/1471-2105-10-396.

Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.

BMC Bioinformatics. 2009 Mar 19;10 Suppl 3(Suppl 3):S1. doi: 10.1186/1471-2105-10-S3-S1.

BLOSUM62 miscalculations improve search performance.

Nat Biotechnol. 2008 Mar;26(3):274-5. doi: 10.1038/nbt0308-274.

Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap.

Bioinformatics. 2005 Oct 15;21(20):3824-31. doi: 10.1093/bioinformatics/bti627. Epub 2005 Aug 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PFASUM：一种来自Pfam结构比对的替换矩阵。

PFASUM: a substitution matrix from Pfam structural alignments.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献