Suppr超能文献

隐马尔可夫模型在恢复糖苷水解酶标准分类中的性能

Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases.

作者信息

Rossi Mariana Fonseca, Mello Beatriz, Schrago Carlos G

机构信息

Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil.

出版信息

Evol Bioinform Online. 2017 Apr 20;13:1176934317703401. doi: 10.1177/1176934317703401. eCollection 2017.

Abstract

Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-automatically. However, with the exponential increase in data availability from genome sequences, automated classification methods are required for the fast annotation of coding sequences. Currently, the dbCAN database offers automatic annotations of signature domains from CAZy-defined classifications using a statistical approach, the hidden Markov models (HMMs). However, dbCAN does not contain the entire set of CAZy GH families. Moreover, no evaluation has been conducted so far of the viability of using HMM profiles as a means of automatically assigning GH amino acid sequences to the standard CAZy GH family classification itself. In this work, we performed a meta-analysis in which amino acid sequences from CAZy-defined GH families were used to build HMM family-specific profiles. We then queried a set with ~300 000 GH sequences against our database of HMM profiles estimated from CAZy families. We conducted the same evaluation against the available dbCAN HMM profiles. Our analyses recovered 65% of matches with the standard CAZy classification, whereas dbCAN HMMs resulted in 61% of matches. We also provided an analysis of the types of errors commonly found when HMMs are used to recover CAZy-based classifications. Although the performance of HMM was good, further developments are necessary for a fully automated classification of GH, allowing the standardization of GH classification among protein databases.

摘要

糖苷水解酶(GHs)是一类碳水化合物活性酶,可协助将复合糖的糖苷键水解为碳水化合物。当前标准的GH家族分类可在CAZy数据库中获取,该分类基于氨基酸序列的相似性并经过半自动整理。然而,随着基因组序列数据可用性呈指数级增长,需要自动化分类方法来快速注释编码序列。目前,dbCAN数据库使用一种统计方法——隐马尔可夫模型(HMMs),对CAZy定义分类中的特征结构域进行自动注释。然而,dbCAN并不包含CAZy GH家族的全部集合。此外,到目前为止,尚未对使用HMM谱作为将GH氨基酸序列自动分配到标准CAZy GH家族分类本身的可行性进行评估。在这项工作中,我们进行了一项荟萃分析,其中使用来自CAZy定义的GH家族的氨基酸序列来构建特定家族的HMM谱。然后,我们针对由CAZy家族估计得到的HMM谱数据库,对一组约300,000个GH序列进行查询。我们针对可用的dbCAN HMM谱进行了相同的评估。我们的分析与标准CAZy分类的匹配率为65%,而dbCAN HMMs的匹配率为61%。我们还分析了使用HMM恢复基于CAZy的分类时常见的错误类型。尽管HMM的性能良好,但要实现GH的全自动分类,仍需要进一步发展,以实现蛋白质数据库中GH分类的标准化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7b5c/5404901/a9a20dba4b07/10.1177_1176934317703401-fig1.jpg

相似文献

1
Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases.
Evol Bioinform Online. 2017 Apr 20;13:1176934317703401. doi: 10.1177/1176934317703401. eCollection 2017.
2
Homology to peptide pattern for annotation of carbohydrate-active enzymes and prediction of function.
BMC Bioinformatics. 2017 Apr 12;18(1):214. doi: 10.1186/s12859-017-1625-9.
3
Function, distribution, and annotation of characterized cellulases, xylanases, and chitinases from CAZy.
Appl Microbiol Biotechnol. 2018 Feb;102(4):1629-1637. doi: 10.1007/s00253-018-8778-y. Epub 2018 Jan 22.
4
dbCAN2: a meta server for automated carbohydrate-active enzyme annotation.
Nucleic Acids Res. 2018 Jul 2;46(W1):W95-W101. doi: 10.1093/nar/gky418.
7
dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation.
Nucleic Acids Res. 2018 Jan 4;46(D1):D516-D521. doi: 10.1093/nar/gkx894.
8
dbCAN: a web resource for automated carbohydrate-active enzyme annotation.
Nucleic Acids Res. 2012 Jul;40(Web Server issue):W445-51. doi: 10.1093/nar/gks479. Epub 2012 May 29.
9
GeneHunt for rapid domain-specific annotation of glycoside hydrolases.
Sci Rep. 2019 Jul 12;9(1):10137. doi: 10.1038/s41598-019-46290-w.
10
The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics.
Nucleic Acids Res. 2009 Jan;37(Database issue):D233-8. doi: 10.1093/nar/gkn663. Epub 2008 Oct 5.

引用本文的文献

2
Altered rumen microbiome and correlations of the metabolome in heat-stressed dairy cows at different growth stages.
Microbiol Spectr. 2023 Dec 12;11(6):e0331223. doi: 10.1128/spectrum.03312-23. Epub 2023 Nov 16.

本文引用的文献

1
The Pfam protein families database: towards a more sustainable future.
Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.
2
CLAP: a web-server for automatic classification of proteins with special reference to multi-domain proteins.
BMC Bioinformatics. 2014 Oct 4;15(1):343. doi: 10.1186/1471-2105-15-343.
4
CAZyme discovery and design for sweet dreams.
Curr Opin Chem Biol. 2014 Apr;19:17-24. doi: 10.1016/j.cbpa.2013.11.014. Epub 2014 Jan 3.
6
The carbohydrate-active enzymes database (CAZy) in 2013.
Nucleic Acids Res. 2014 Jan;42(Database issue):D490-5. doi: 10.1093/nar/gkt1178. Epub 2013 Nov 21.
8
Metagenomics of the Svalbard reindeer rumen microbiome reveals abundance of polysaccharide utilization loci.
PLoS One. 2012;7(6):e38571. doi: 10.1371/journal.pone.0038571. Epub 2012 Jun 6.
9
dbCAN: a web resource for automated carbohydrate-active enzyme annotation.
Nucleic Acids Res. 2012 Jul;40(Web Server issue):W445-51. doi: 10.1093/nar/gks479. Epub 2012 May 29.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验