Sekowska Agnieszka, Danchin Antoine, Risler Jean-Loup
Hong Kong University Pasteur Research Centre, Dexter HC Man Building, 8 Sassoon Road, Pokfulam, Hong Kong2.
Regulation of Gene Expression, Institut Pasteur, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France1.
Microbiology (Reading). 2000 Aug;146 ( Pt 8):1815-1828. doi: 10.1099/00221287-146-8-1815.
Genome annotation requires explicit identification of gene function. This task frequently uses protein sequence alignments with examples having a known function. Genetic drift, co-evolution of subunits in protein complexes and a variety of other constraints interfere with the relevance of alignments. Using a specific class of proteins, it is shown that a simple data analysis approach can help solve some of the problems posed. The origin of ureohydrolases has been explored by comparing sequence similarity trees, maximizing amino acid alignment conservation. The trees separate agmatinases from arginases but suggest the presence of unknown biases responsible for unexpected positions of some enzymes. Using factorial correspondence analysis, a distance tree between sequences was established, comparing regions with gaps in the alignments. The gap tree gives a consistent picture of functional kinship, perhaps reflecting some aspects of phylogeny, with a clear domain of enzymes encoding two types of ureohydrolases (agmatinases and arginases) and activities related to, but different from ureohydrolases. Several annotated genes appeared to correspond to a wrong assignment if the trees were significant. They were cloned and their products expressed and identified biochemically. This substantiated the validity of the gap tree. Its organization suggests a very ancient origin of ureohydrolases. Some enzymes of eukaryotic origin are spread throughout the arginase part of the trees: they might have been derived from the genes found in the early symbiotic bacteria that became the organelles. They were transferred to the nucleus when symbiotic genes had to escape Muller's ratchet. This work also shows that arginases and agmatinases share the same two manganese-ion-binding sites and exhibit only subtle differences that can be accounted for knowing the three-dimensional structure of arginases. In the absence of explicit biochemical data, extreme caution is needed when annotating genes having similarities to ureohydrolases.
基因组注释需要明确鉴定基因功能。这项任务经常使用蛋白质序列比对,其中的示例具有已知功能。遗传漂变、蛋白质复合物中亚基的共同进化以及各种其他限制因素会干扰比对的相关性。通过使用特定类别的蛋白质,研究表明一种简单的数据分析方法有助于解决所提出的一些问题。通过比较序列相似性树并最大化氨基酸比对保守性,探索了脲水解酶的起源。这些树将胍丁胺酶与精氨酸酶区分开来,但表明存在未知的偏差,这些偏差导致了一些酶处于意外位置。使用因子对应分析,建立了序列之间的距离树,比较了比对中有缺口的区域。缺口树给出了功能亲缘关系的一致图景,可能反映了系统发育的某些方面,其中编码两种脲水解酶(胍丁胺酶和精氨酸酶)以及与脲水解酶相关但不同的活性的酶有一个清晰的区域。如果这些树具有显著性,一些注释基因似乎对应着错误的归属。它们被克隆,其产物被表达并进行生化鉴定。这证实了缺口树的有效性。其结构表明脲水解酶起源非常古老。一些真核起源的酶分布在树的精氨酸酶部分:它们可能源自早期共生细菌中发现的基因,这些细菌后来变成了细胞器。当共生基因必须逃离穆勒棘轮时,它们被转移到了细胞核中。这项工作还表明,精氨酸酶和胍丁胺酶共享相同的两个锰离子结合位点,并且仅表现出细微差异,了解精氨酸酶的三维结构后就能解释这些差异。在缺乏明确生化数据的情况下,注释与脲水解酶相似的基因时需要极其谨慎。