Suppr超能文献

填补蛋白质比对统计模型中的空白。

Bridging the gaps in statistical models of protein alignment.

机构信息

Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia.

出版信息

Bioinformatics. 2022 Jun 24;38(Suppl 1):i229-i237. doi: 10.1093/bioinformatics/btac246.

Abstract

SUMMARY

Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

摘要

蛋白质序列的进化是通过积累替换以及氨基酸的插入和缺失(indels)共同作用的结果。然而,将替换和 indels 分开,分别推断它们的近似模型,以量化序列关系,仍然是一种常见的做法。尽管这种方法带来了计算上的便利(这仍然是其主要动机),但系统地和统一地对它们进行建模的尝试却很少。为了克服这一差距,本文展示了如何使用时间参数化的替换矩阵和时间参数化的对齐状态机,构建一个量化对对齐蛋白质进化的完整统计模型。本文描述了如何从任何基准对齐蛋白质序列集合中推导出这种模型的所有参数。这不仅允许我们为九个广泛使用的替换矩阵(PAM、JTT、BLOSUM、JO、WAG、VTML、LG、MIQS 和 PFASUM)中的每一个生成一个统一的统计模型,还导致了一个新的统一模型 MMLSUM。我们的基本方法使用每个模型来测量香农信息量,以无损地解释任何给定的对齐集合,这使我们能够量化所有上述模型在六个综合对齐基准上的性能。我们的结果表明,MMLSUM 的总体性能最佳,其次是 PFASUM、VTML、BLOSUM 和 MIQS。我们进一步分析了 MMLSUM 模型的统计特性,并与其他模型进行了对比。

补充信息

补充数据可在 Bioinformatics 在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b89e/9235498/bd90e8d796a4/btac246f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验