Department of Biochemistry and Bioinformatics Centre, Indian Institute of Science, Bangalore, India.
PLoS One. 2011;6(10):e27044. doi: 10.1371/journal.pone.0027044. Epub 2011 Oct 31.
Of the ∼4000 ORFs identified through the genome sequence of Mycobacterium tuberculosis (TB) H37Rv, experimentally determined structures are available for 312. Since knowledge of protein structures is essential to obtain a high-resolution understanding of the underlying biology, we seek to obtain a structural annotation for the genome, using computational methods. Structural models were obtained and validated for ∼2877 ORFs, covering ∼70% of the genome. Functional annotation of each protein was based on fold-based functional assignments and a novel binding site based ligand association. New algorithms for binding site detection and genome scale binding site comparison at the structural level, recently reported from the laboratory, were utilized. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides an opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses. The resource (http://proline.physics.iisc.ernet.in/Tbstructuralannotation), being one of the first to be based on structure-derived functional annotations at a genome scale, is expected to be useful for better understanding of TB and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.
在结核分枝杆菌 (TB) H37Rv 的基因组序列中鉴定出的约 4000 个开放阅读框中,有 312 个的实验结构已经确定。由于了解蛋白质结构对于获得对基础生物学的高分辨率理解至关重要,因此我们寻求使用计算方法为基因组获得结构注释。我们获得并验证了约 2877 个 ORF 的结构模型,涵盖了约 70%的基因组。每个蛋白质的功能注释都是基于折叠功能分配和新的基于配体结合的结合位点。最近从实验室报告的新的用于检测和基因组范围内结构水平结合位点比较的结合位点检测算法也得到了利用。除了这些,注释还涵盖了基于相应模板的各种序列和亚结构基序以及四级结构预测的检测。该研究为获得基因组中折叠分布的全局视角提供了机会。注释表明,细胞代谢仅需要 219 个折叠即可实现。还获得了有关在基因组中占主导地位的折叠以及构成多结构域蛋白的折叠组合的新见解。通过结合位点识别和亚结构相似性分析,已经将 1728 个结合口袋与配体相关联。该资源(http://proline.physics.iisc.ernet.in/Tbstructuralannotation)是第一个基于结构衍生的功能注释的资源之一,预计将有助于更好地理解结核病并应用于药物发现。报告的注释管道相当通用,也可以应用于其他基因组。