利用数据整合和谱聚类检测同功能蛋白亚家族

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

作者信息

Boari de Lima Elisa, Meira Wagner, Melo-Minardi Raquel Cardoso de

机构信息

Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.

Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.

出版信息

PLoS Comput Biol. 2016 Jun 27;12(6):e1005001. doi: 10.1371/journal.pcbi.1005001. eCollection 2016 Jun.

DOI:10.1371/journal.pcbi.1005001

PMID:27348631

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4922564/

Abstract

As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.

摘要

随着越来越多的基因组被测序，鉴于实验研究成本极高，绝大多数蛋白质可能只能通过计算进行注释。这凸显了快速且可靠地确定蛋白质功能的计算方法的必要性。我们认为，将蛋白质家族划分为具有整个家族所不常见的特定功能的亚型，可以降低功能注释问题的复杂性。因此，这项工作的目的是在一个未知功能的家族中检测同功能亚家族，同时识别区分性残基。根据各种属性的蛋白质对之间的相似性被解释为功能相似性证据。使用遗传编程对数据进行整合，并将其提供给谱聚类算法，该算法创建相似蛋白质的簇。将所提出的框架应用于知名蛋白质家族和一个未知功能的家族，然后与ASMC进行比较。结果表明，除了在其他两个家族中得到等效结果（包括一个其簇是手动定义的家族）外，我们的全自动技术在两个家族中获得了比ASMC更好的簇。我们框架产生的簇与已知亚家族显示出高度一致性，并且比ASMC产生的簇更具对比性。此外，对于那些已知特异性决定位置的家族，这些残基是我们的技术认为对区分给定组最重要的残基之一。当与巴豆酸酶和烯醇酶SFLD超家族一起运行时，结果与这个黄金标准显示出高度一致性。最佳结果始终涉及多种数据类型，从而证实了我们的假设，即根据不同知识领域的相似性可以用作功能相似性证据。我们的主要贡献包括提出的选择和整合数据类型的策略，以及处理噪声和不完整数据的能力；利用领域知识在具有不同特异性的家族中检测亚家族，从而降低实验功能表征问题的复杂性；以及识别负责特异性的残基。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef7f/4922564/4eff6604b788/pcbi.1005001.g001.jpg

相似文献

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

PLoS Comput Biol. 2016 Jun 27;12(6):e1005001. doi: 10.1371/journal.pcbi.1005001. eCollection 2016 Jun.

Identification of subfamily-specific sites based on active sites modeling and clustering.

Bioinformatics. 2010 Dec 15;26(24):3075-82. doi: 10.1093/bioinformatics/btq595. Epub 2010 Oct 26.

Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity.

Protein Sci. 2015 Sep;24(9):1423-39. doi: 10.1002/pro.2724. Epub 2015 Aug 18.

Automated protein subfamily identification and classification.

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

Using affinity propagation combined post-processing to cluster protein sequences.

Protein Pept Lett. 2010 Jun;17(6):681-9. doi: 10.2174/092986610791190255.

Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.

PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.

Automatic annotation of protein function based on family identification.

Proteins. 2003 Nov 15;53(3):683-92. doi: 10.1002/prot.10449.

Incremental generation of summarized clustering hierarchy for protein family analysis.

Bioinformatics. 2004 Nov 1;20(16):2586-96. doi: 10.1093/bioinformatics/bth290. Epub 2004 May 6.

SplitTester: software to identify domains responsible for functional divergence in protein family.

BMC Bioinformatics. 2005 Jun 1;6:137. doi: 10.1186/1471-2105-6-137.

Clustering the annotation space of proteins.

BMC Bioinformatics. 2005 Feb 9;6:24. doi: 10.1186/1471-2105-6-24.

引用本文的文献

ASMC: investigating the amino acid diversity of enzyme active sites.

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf307.

Beyond sequence: Structure-based machine learning.

Comput Struct Biotechnol J. 2022 Dec 29;21:630-643. doi: 10.1016/j.csbj.2022.12.039. eCollection 2023.

Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.

Mol Biol Evol. 2022 Apr 10;39(4). doi: 10.1093/molbev/msac070.

New computational approaches to understanding molecular protein function.

PLoS Comput Biol. 2018 Apr 5;14(4):e1005756. doi: 10.1371/journal.pcbi.1005756. eCollection 2018 Apr.

Shrinkage Clustering: a fast and size-constrained clustering algorithm for biomedical applications.

BMC Bioinformatics. 2018 Jan 23;19(1):19. doi: 10.1186/s12859-018-2022-8.

本文引用的文献

The InterPro protein families database: the classification resource after 15 years.

Nucleic Acids Res. 2015 Jan;43(Database issue):D213-21. doi: 10.1093/nar/gku1243. Epub 2014 Nov 26.

Gene Ontology Consortium: going forward.

Nucleic Acids Res. 2015 Jan;43(Database issue):D1049-56. doi: 10.1093/nar/gku1179. Epub 2014 Nov 26.

ENZYMAP: exploiting protein annotation for modeling and predicting EC number changes in UniProt/Swiss-Prot.

PLoS One. 2014 Feb 19;9(2):e89162. doi: 10.1371/journal.pone.0089162. eCollection 2014.

The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes.

Nucleic Acids Res. 2014 Jan;42(Database issue):D485-9. doi: 10.1093/nar/gkt1243. Epub 2013 Dec 6.

Pfam: the protein families database.

Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30. doi: 10.1093/nar/gkt1223. Epub 2013 Nov 27.

The Structure-Function Linkage Database.

Nucleic Acids Res. 2014 Jan;42(Database issue):D521-30. doi: 10.1093/nar/gkt1130. Epub 2013 Nov 23.

Revealing the hidden functional diversity of an enzyme family.

Nat Chem Biol. 2014 Jan;10(1):42-9. doi: 10.1038/nchembio.1387. Epub 2013 Nov 17.

A large-scale evaluation of computational protein function prediction.

Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.

STRING v9.1: protein-protein interaction networks, with increased coverage and integration.

Nucleic Acids Res. 2013 Jan;41(Database issue):D808-15. doi: 10.1093/nar/gks1094. Epub 2012 Nov 29.

Relationship between global structural parameters and Enzyme Commission hierarchy: implications for function prediction.

Comput Biol Chem. 2012 Oct;40:15-9. doi: 10.1016/j.compbiolchem.2012.06.003. Epub 2012 Aug 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用数据整合和谱聚类检测同功能蛋白亚家族

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献