蛋白质亚家族的自动识别与分类

Automated protein subfamily identification and classification.

作者信息

Brown Duncan P, Krishnamurthy Nandini, Sjölander Kimmen

机构信息

Department of Bioengineering, University of California, Berkeley, California, United States of America.

出版信息

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

DOI:10.1371/journal.pcbi.0030160

PMID:17708678

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1950344/

Abstract

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.

摘要

通过同源性进行功能预测被广泛用于为那些缺乏或仅有有限功能实验证据的基因提供初步的功能注释。这种方法已被证明容易出现系统误差，包括注释错误通过序列数据库的渗透。系统发育基因组分析避免了功能预测中的这些错误，但对于高通量应用来说，一直难以实现自动化。为了解决这一局限性，我们提出了一种用于蛋白质系统发育基因组分类的计算高效的流程。该流程使用SCI-PHY（系统发育基因组学中的亚家族分类）算法进行自动亚家族识别，随后构建亚家族隐马尔可夫模型（HMM）。一种使用家族和亚家族HMM的简单且计算高效的评分方案能够将新序列分类到蛋白质家族和亚家族中。使用逻辑回归将代表全新亚家族的序列与那些可以分类到输入训练集中亚家族的序列区分开来。亚家族HMM参数使用信息共享协议进行估计，使得即使包含单个序列的亚家族也能从定义整个家族或相关亚家族的保守模式中受益。SCI-PHY亚家族与专家定义的功能亚型以及系统发育分析发现的保守进化枝密切对应。对亚家族和家族HMM性能的广泛比较表明，亚家族HMM在序列数据库搜索中显著提高了同源和非同源蛋白质之间的区分度。亚家族HMM还提供了极高的分类特异性，可用于预测全新的亚型。位于http://phylogenomics.berkeley.edu/SCI-PHY/的SCI-PHY网络服务器允许用户上传多序列比对进行亚家族识别和亚家族HMM构建。希望提供自己亚家族定义的生物学家也可以这样做。网页上提供了源代码。伯克利系统发育基因组学小组的PhyloFacts资源在http://phylogenomics.berkeley.edu/phylofacts/上包含了针对40000多个蛋白质家族和结构域的预先计算的亚家族预测和亚家族HMM。

相似文献

Automated protein subfamily identification and classification.

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis.

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W27-32. doi: 10.1093/nar/gkm325. Epub 2007 May 8.

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences.

BMC Bioinformatics. 2007 Mar 27;8:104. doi: 10.1186/1471-2105-8-104.

FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function.

BMC Evol Biol. 2007 Feb 8;7 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2148-7-S1-S12.

Algorithms for incorporating prior topological information in HMMs: application to transmembrane proteins.

BMC Bioinformatics. 2006 Apr 5;7:189. doi: 10.1186/1471-2105-7-189.

PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification.

Genome Biol. 2006;7(9):R83. doi: 10.1186/gb-2006-7-9-r83.

The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification.

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W242-8. doi: 10.1093/nar/gkt399. Epub 2013 May 18.

Designing patterns for profile HMM search.

Bioinformatics. 2007 Jan 15;23(2):e36-43. doi: 10.1093/bioinformatics/btl323.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Subfamily hmms in functional genomics.

Pac Symp Biocomput. 2005:322-33.

引用本文的文献

ASMC: investigating the amino acid diversity of enzyme active sites.

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf307.

Combining mutation and recombination statistics to infer clonal families in antibody repertoires.

Elife. 2024 Aug 9;13:e86181. doi: 10.7554/eLife.86181.

Quality assessment and community detection methods for anonymized mobility data in the Italian Covid context.

Sci Rep. 2024 Feb 26;14(1):4636. doi: 10.1038/s41598-024-54878-0.

Hemolytic-Pred: A machine learning-based predictor for hemolytic proteins using position and composition-based features.

Digit Health. 2023 Jul 5;9:20552076231180739. doi: 10.1177/20552076231180739. eCollection 2023 Jan-Dec.

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention.

Pac Symp Biocomput. 2022;27:34-45.

A Bipartite Geminivirus with a Highly Divergent Genomic Organization Identified in Olive Trees May Represent a Novel Evolutionary Direction in the Family .

Viruses. 2021 Oct 9;13(10):2035. doi: 10.3390/v13102035.

Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class.

PLoS Comput Biol. 2021 Sep 23;17(9):e1009446. doi: 10.1371/journal.pcbi.1009446. eCollection 2021 Sep.

Capabilities of bioinformatics tools for optimizing physicochemical features of proteins used in Nano biosensors: A short overview of the tools related to bioinformatics.

Biochem Biophys Rep. 2021 Aug 3;27:101094. doi: 10.1016/j.bbrep.2021.101094. eCollection 2021 Sep.

Improving integrative 3D modeling into low- to medium-resolution electron microscopy structures with evolutionary couplings.

Protein Sci. 2021 May;30(5):1006-1021. doi: 10.1002/pro.4067. Epub 2021 Apr 9.

A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries.

PLoS Comput Biol. 2020 Apr 1;16(4):e1007721. doi: 10.1371/journal.pcbi.1007721. eCollection 2020 Apr.

本文引用的文献

Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis.

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W27-32. doi: 10.1093/nar/gkm325. Epub 2007 May 8.

PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification.

Genome Biol. 2006;7(9):R83. doi: 10.1186/gb-2006-7-9-r83.

Functional classification using phylogenomic inference.

PLoS Comput Biol. 2006 Jun 30;2(6):e77. doi: 10.1371/journal.pcbi.0020077.

Automated protein function prediction--the genomic challenge.

Brief Bioinform. 2006 Sep;7(3):225-42. doi: 10.1093/bib/bbl004. Epub 2006 May 23.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.

Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database.

Biochemistry. 2006 Feb 28;45(8):2545-55. doi: 10.1021/bi052101l.

Protein molecular function prediction by Bayesian phylogenomics.

PLoS Comput Biol. 2005 Oct;1(5):e45. doi: 10.1371/journal.pcbi.0010045. Epub 2005 Oct 7.

The prediction of protein function at CASP6.

Proteins. 2005;61 Suppl 7:201-213. doi: 10.1002/prot.20738.

FIGENIX: intelligent automation of genomic annotation: expertise integration in a new software platform.

BMC Bioinformatics. 2005 Aug 5;6:198. doi: 10.1186/1471-2105-6-198.

Semi-supervised protein classification using cluster kernels.

Bioinformatics. 2005 Aug 1;21(15):3241-7. doi: 10.1093/bioinformatics/bti497. Epub 2005 May 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

蛋白质亚家族的自动识别与分类

Automated protein subfamily identification and classification.

作者信息

Brown Duncan P, Krishnamurthy Nandini, Sjölander Kimmen

机构信息

Department of Bioengineering, University of California, Berkeley, California, United States of America.

出版信息

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

DOI:10.1371/journal.pcbi.0030160

PMID:17708678

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1950344/

Abstract

摘要

蛋白质亚家族的自动识别与分类

Automated protein subfamily identification and classification.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

蛋白质亚家族的自动识别与分类

Automated protein subfamily identification and classification.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献