Roche Daniel Barry, Brüls Thomas
Laboratoire de génomique et biochimie du métabolisme, Genoscope, Institut de Génomique, Commissariat à l'Energie Atomique et aux Energies Alternatives, Evry, Essonne, 91057, France; UMR 8030 - Génomique Métabolique, Centre National de la Recherche Scientifique, Evry, Essonne, 91057, France; Départment de Biologie, Université d'Evry-Val-d'Essonne, Evry, Essonne, 91000, France; PRES UniverSud Paris, Saint-Aubin, Essonne, 91190, France.
Protein Sci. 2015 May;24(5):643-50. doi: 10.1002/pro.2635. Epub 2015 Jan 28.
As the largest fraction of any proteome does not carry out enzymatic functions, and in order to leverage 3D structural data for the annotation of increasingly higher volumes of sequence data, we wanted to assess the strength of the link between coarse grained structural data (i.e., homologous superfamily level) and the enzymatic versus non-enzymatic nature of protein sequences. To probe this relationship, we took advantage of 41 phylogenetically diverse (encompassing 11 distinct phyla) genomes recently sequenced within the GEBA initiative, for which we integrated structural information, as defined by CATH, with enzyme level information, as defined by Enzyme Commission (EC) numbers. This analysis revealed that only a very small fraction (about 1%) of domain sequences occurring in the analyzed genomes was found to be associated with homologous superfamilies strongly indicative of enzymatic function. Resorting to less stringent criteria to define enzyme versus non-enzyme biased structural classes or excluding highly prevalent folds from the analysis had only modest effect on this proportion. Thus, the low genomic coverage by structurally anchored protein domains strongly associated to catalytic activities indicates that, on its own, the power of coarse grained structural information to infer the general property of being an enzyme is rather limited.
由于任何蛋白质组中最大的部分并不执行酶促功能,并且为了利用三维结构数据来注释越来越多的序列数据,我们想要评估粗粒度结构数据(即同源超家族水平)与蛋白质序列的酶促性质和非酶促性质之间联系的强度。为了探究这种关系,我们利用了基因百科全书(GEBA)计划最近测序的41个系统发育多样(涵盖11个不同门)的基因组,对于这些基因组,我们将由CATH定义的结构信息与由酶委员会(EC)编号定义的酶水平信息整合在一起。该分析表明,在所分析的基因组中出现的结构域序列中,只有非常小的一部分(约1%)被发现与强烈指示酶促功能的同源超家族相关。采用不太严格的标准来定义偏向酶或非酶的结构类别,或者在分析中排除高度普遍的折叠,对这一比例的影响不大。因此,与催化活性强烈相关的结构锚定蛋白质结构域的低基因组覆盖率表明,仅靠粗粒度结构信息推断作为一种酶的一般性质的能力相当有限。