多序列比对中所包含的序列同源物的选择对进化保守性分析有显著影响。

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

机构信息

Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA.

出版信息

Bioinformatics. 2019 Jan 1;35(1):12-19. doi: 10.1093/bioinformatics/bty523.

DOI:10.1093/bioinformatics/bty523

PMID:29947739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6298051/

Abstract

MOTIVATION

The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).

RESULTS

The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

半个多世纪以来，序列保守模式分析已广泛用于鉴定功能重要的（催化和配体结合）蛋白质残基。尽管经过几十年的发展，基于非模板的最先进功能残基预测方法平均必须预测蛋白质总残基的约 25%，才能正确识别蛋白质功能位点残基的一半。大量的假阳性结果导致报告的“F 分数”约为 0.3。我们研究了当前方法的局限性，重点关注多序列比对 (MSA) 中包含的同源物的具体选择迄今为止被忽视的影响。

结果

通过调查 1023 个蛋白质的结合位点，探讨了基于保守的功能残基预测的局限性。对从 PSI-BLAST 搜索中随机选择的同源物组成的 MSA 进行简单的保守分析，平均 F 分数约为 0.3，与最先进方法的报告性能相匹配，后者通常在机器学习环境中考虑额外的特征进行预测。有趣的是，我们发现，一个简单的组合 MSA 采样算法几乎在每种情况下都会产生一个具有最佳同源物集的 MSA，其保守分析的平均 F 分数约为 0.6，是最先进性能的两倍。我们还表明，鉴于不同结合位点定义之间的一致性，这几乎是可能性能的理论极限。此外，我们展示了 Selection of Alignment by Maximal Mutual Information (SAMMI)（一种基于信息论的识别生物信息 MSAs 的方法）在这一方向上取得的进展。这项工作强调了最优组成的 MSA 对保守分析的重要性和未被充分利用的潜力。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.多序列比对中所包含的序列同源物的选择对进化保守性分析有显著影响。

Bioinformatics. 2019 Jan 1;35(1):12-19. doi: 10.1093/bioinformatics/bty523.

Identifying functionally informative evolutionary sequence profiles.识别具有功能信息的进化序列特征。

Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.

The ConSurf-HSSP database: the mapping of evolutionary conservation among homologs onto PDB structures.ConSurf-HSSP数据库：同源物间进化保守性在蛋白质数据银行（PDB）结构上的映射。

Proteins. 2005 Feb 15;58(3):610-7. doi: 10.1002/prot.20305.

INTREPID--INformation-theoretic TREe traversal for Protein functional site IDentification.INTREPID——用于蛋白质功能位点识别的信息论树遍历法

Bioinformatics. 2008 Nov 1;24(21):2445-52. doi: 10.1093/bioinformatics/btn474. Epub 2008 Sep 6.

Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases.用于蛋白质同源物的迭代序列/二级结构搜索：与氨基酸序列比对的比较及在基因组数据库中折叠识别的应用

Bioinformatics. 2000 Nov;16(11):988-1002. doi: 10.1093/bioinformatics/16.11.988.

AL2CO: calculation of positional conservation in a protein sequence alignment.AL2CO：蛋白质序列比对中位置保守性的计算

Bioinformatics. 2001 Aug;17(8):700-12. doi: 10.1093/bioinformatics/17.8.700.

PROMALS web server for accurate multiple protein sequence alignments.用于精确多蛋白序列比对的PROMALS网络服务器。

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W649-52. doi: 10.1093/nar/gkm227. Epub 2007 Apr 22.

PROMALS: towards accurate multiple sequence alignments of distantly related proteins.PROMALS：用于实现远缘相关蛋白质准确多序列比对

Bioinformatics. 2007 Apr 1;23(7):802-8. doi: 10.1093/bioinformatics/btm017. Epub 2007 Jan 31.

Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure.通过结合进化序列保守性和 3D 结构预测蛋白质配体结合位点。

PLoS Comput Biol. 2009 Dec;5(12):e1000585. doi: 10.1371/journal.pcbi.1000585. Epub 2009 Dec 4.

Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.利用来自低质量多序列比对的进化信息改进蛋白质-蛋白质相互作用预测。

PLoS One. 2017 Feb 6;12(2):e0169356. doi: 10.1371/journal.pone.0169356. eCollection 2017.

引用本文的文献

The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.多重序列比对在分子结构与功能预测中的历史演变及意义

Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531.

Assessing the functional impact of protein binding site definition.评估蛋白质结合位点定义的功能影响。

Protein Sci. 2024 Jun;33(6):e5026. doi: 10.1002/pro.5026.

Enhancing the thermostability and activity of glycosyltransferase UGT76G1 via computational design.通过计算机设计提高糖基转移酶UGT76G1的热稳定性和活性。

Commun Chem. 2023 Dec 6;6(1):265. doi: 10.1038/s42004-023-01070-6.

Optimal selection of suitable templates in protein interface prediction.蛋白质界面预测中合适模板的最优选择。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad510.

cpxDeepMSA: A Deep Cascade Algorithm for Constructing Multiple Sequence Alignments of Protein-Protein Interactions.cpxDeepMSA：一种用于构建蛋白质-蛋白质相互作用多重序列比对的深度级联算法。

Int J Mol Sci. 2022 Jul 30;23(15):8459. doi: 10.3390/ijms23158459.

Integrated structure-based protein interface prediction.基于结构的蛋白质界面整体预测。

BMC Bioinformatics. 2022 Jul 25;23(1):301. doi: 10.1186/s12859-022-04852-2.

Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences.MAHDS方法在高度分化氨基酸序列多重比对中的应用。

Int J Mol Sci. 2022 Mar 29;23(7):3764. doi: 10.3390/ijms23073764.

Computational Enzyme Engineering Pipelines for Optimized Production of Renewable Chemicals.用于优化可再生化学品生产的计算酶工程流程

Front Bioeng Biotechnol. 2021 Jun 15;9:673005. doi: 10.3389/fbioe.2021.673005. eCollection 2021.

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads.一个是不够的：参考基因组对短读段映射及后续分析的影响

PLoS Comput Biol. 2021 Jan 27;17(1):e1008678. doi: 10.1371/journal.pcbi.1008678. eCollection 2021 Jan.

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins.DeepMSA：构建深度多重序列比对以改进远距离同源蛋白质的接触预测和折叠识别。

Bioinformatics. 2020 Apr 1;36(7):2105-2112. doi: 10.1093/bioinformatics/btz863.

本文引用的文献

Identifying functionally informative evolutionary sequence profiles.识别具有功能信息的进化序列特征。

Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.

Structure-based prediction of protein- peptide binding regions using Random Forest.基于结构的随机森林预测蛋白肽结合区域。

Bioinformatics. 2018 Feb 1;34(3):477-484. doi: 10.1093/bioinformatics/btx614.

Review and comparative assessment of sequence-based predictors of protein-binding residues.基于序列的蛋白质结合残基预测因子的回顾与比较评估。

Brief Bioinform. 2018 Sep 28;19(5):821-837. doi: 10.1093/bib/bbx022.

Database Resources of the National Center for Biotechnology Information.美国国立医学图书馆国家生物技术信息中心数据库资源

Nucleic Acids Res. 2017 Jan 4;45(D1):D12-D17. doi: 10.1093/nar/gkw1071. Epub 2016 Nov 28.

CRHunter: integrating multifaceted information to predict catalytic residues in enzymes.CRHunter：整合多方面信息以预测酶中的催化残基。

Sci Rep. 2016 Sep 26;6:34044. doi: 10.1038/srep34044.

Sequence-based prediction of protein-peptide binding sites using support vector machine.基于序列的支持向量机预测蛋白质-肽结合位点。

J Comput Chem. 2016 May 15;37(13):1223-9. doi: 10.1002/jcc.24314. Epub 2016 Feb 2.

A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues.基于序列的DNA和RNA结合残基预测因子的全面比较综述。

Brief Bioinform. 2016 Jan;17(1):88-105. doi: 10.1093/bib/bbv023. Epub 2015 May 1.

LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone.LigandRFs：一种随机森林集成算法，可仅通过序列信息识别配体结合残基。

BMC Bioinformatics. 2014;15 Suppl 15(Suppl 15):S4. doi: 10.1186/1471-2105-15-S15-S4. Epub 2014 Dec 3.

Prediction of DNA binding motifs from 3D models of transcription factors; identifying TLX3 regulated genes.从转录因子的三维模型预测DNA结合基序；鉴定TLX3调控的基因。

Nucleic Acids Res. 2014 Dec 16;42(22):13500-12. doi: 10.1093/nar/gku1228. Epub 2014 Nov 26.

UniProt: a hub for protein information.通用蛋白质数据库（UniProt）：蛋白质信息中心。

Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验