Rossi Mariana Fonseca, Mello Beatriz, Schrago Carlos G
Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil.
Evol Bioinform Online. 2017 Apr 20;13:1176934317703401. doi: 10.1177/1176934317703401. eCollection 2017.
Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-automatically. However, with the exponential increase in data availability from genome sequences, automated classification methods are required for the fast annotation of coding sequences. Currently, the dbCAN database offers automatic annotations of signature domains from CAZy-defined classifications using a statistical approach, the hidden Markov models (HMMs). However, dbCAN does not contain the entire set of CAZy GH families. Moreover, no evaluation has been conducted so far of the viability of using HMM profiles as a means of automatically assigning GH amino acid sequences to the standard CAZy GH family classification itself. In this work, we performed a meta-analysis in which amino acid sequences from CAZy-defined GH families were used to build HMM family-specific profiles. We then queried a set with ~300 000 GH sequences against our database of HMM profiles estimated from CAZy families. We conducted the same evaluation against the available dbCAN HMM profiles. Our analyses recovered 65% of matches with the standard CAZy classification, whereas dbCAN HMMs resulted in 61% of matches. We also provided an analysis of the types of errors commonly found when HMMs are used to recover CAZy-based classifications. Although the performance of HMM was good, further developments are necessary for a fully automated classification of GH, allowing the standardization of GH classification among protein databases.
糖苷水解酶(GHs)是一类碳水化合物活性酶,可协助将复合糖的糖苷键水解为碳水化合物。当前标准的GH家族分类可在CAZy数据库中获取,该分类基于氨基酸序列的相似性并经过半自动整理。然而,随着基因组序列数据可用性呈指数级增长,需要自动化分类方法来快速注释编码序列。目前,dbCAN数据库使用一种统计方法——隐马尔可夫模型(HMMs),对CAZy定义分类中的特征结构域进行自动注释。然而,dbCAN并不包含CAZy GH家族的全部集合。此外,到目前为止,尚未对使用HMM谱作为将GH氨基酸序列自动分配到标准CAZy GH家族分类本身的可行性进行评估。在这项工作中,我们进行了一项荟萃分析,其中使用来自CAZy定义的GH家族的氨基酸序列来构建特定家族的HMM谱。然后,我们针对由CAZy家族估计得到的HMM谱数据库,对一组约300,000个GH序列进行查询。我们针对可用的dbCAN HMM谱进行了相同的评估。我们的分析与标准CAZy分类的匹配率为65%,而dbCAN HMMs的匹配率为61%。我们还分析了使用HMM恢复基于CAZy的分类时常见的错误类型。尽管HMM的性能良好,但要实现GH的全自动分类,仍需要进一步发展,以实现蛋白质数据库中GH分类的标准化。