Suppr超能文献

多序列比对中所包含的序列同源物的选择对进化保守性分析有显著影响。

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

机构信息

Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA.

出版信息

Bioinformatics. 2019 Jan 1;35(1):12-19. doi: 10.1093/bioinformatics/bty523.

Abstract

MOTIVATION

The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).

RESULTS

The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

半个多世纪以来,序列保守模式分析已广泛用于鉴定功能重要的(催化和配体结合)蛋白质残基。尽管经过几十年的发展,基于非模板的最先进功能残基预测方法平均必须预测蛋白质总残基的约 25%,才能正确识别蛋白质功能位点残基的一半。大量的假阳性结果导致报告的“F 分数”约为 0.3。我们研究了当前方法的局限性,重点关注多序列比对 (MSA) 中包含的同源物的具体选择迄今为止被忽视的影响。

结果

通过调查 1023 个蛋白质的结合位点,探讨了基于保守的功能残基预测的局限性。对从 PSI-BLAST 搜索中随机选择的同源物组成的 MSA 进行简单的保守分析,平均 F 分数约为 0.3,与最先进方法的报告性能相匹配,后者通常在机器学习环境中考虑额外的特征进行预测。有趣的是,我们发现,一个简单的组合 MSA 采样算法几乎在每种情况下都会产生一个具有最佳同源物集的 MSA,其保守分析的平均 F 分数约为 0.6,是最先进性能的两倍。我们还表明,鉴于不同结合位点定义之间的一致性,这几乎是可能性能的理论极限。此外,我们展示了 Selection of Alignment by Maximal Mutual Information (SAMMI)(一种基于信息论的识别生物信息 MSAs 的方法)在这一方向上取得的进展。这项工作强调了最优组成的 MSA 对保守分析的重要性和未被充分利用的潜力。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

2
Identifying functionally informative evolutionary sequence profiles.识别具有功能信息的进化序列特征。
Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.
7
PROMALS web server for accurate multiple protein sequence alignments.用于精确多蛋白序列比对的PROMALS网络服务器。
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W649-52. doi: 10.1093/nar/gkm227. Epub 2007 Apr 22.

引用本文的文献

6
Integrated structure-based protein interface prediction.基于结构的蛋白质界面整体预测。
BMC Bioinformatics. 2022 Jul 25;23(1):301. doi: 10.1186/s12859-022-04852-2.

本文引用的文献

1
Identifying functionally informative evolutionary sequence profiles.识别具有功能信息的进化序列特征。
Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.
10
UniProt: a hub for protein information.通用蛋白质数据库(UniProt):蛋白质信息中心。
Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验