Deng Lei, Chen Zhigang
IEEE/ACM Trans Comput Biol Bioinform. 2015 Jul-Aug;12(4):902-13. doi: 10.1109/TCBB.2015.2389213.
Structural domains are evolutionary and functional units of proteins and play a critical role in comparative and functional genomics. Computational assignment of domain function with high reliability is essential for understanding whole-protein functions. However, functional annotations are conventionally assigned onto full-length proteins rather than associating specific functions to the individual structural domains. In this article, we present Structural Domain Annotation (SDA), a novel computational approach to predict functions for SCOP structural domains. The SDA method integrates heterogeneous information sources, including structure alignment based protein-SCOP mapping features, InterPro2GO mapping information, PSSM Profiles, and sequence neighborhood features, with a Bayesian network. By large-scale annotating Gene Ontology terms to SCOP domains with SDA, we obtained a database of SCOP domain to Gene Ontology mappings, which contains ~162,000 out of the approximately 166,900 domains in SCOPe 2.03 (>97 percent) and their predicted Gene Ontology functions. We have benchmarked SDA using a single-domain protein dataset and an independent dataset from different species. Comparative studies show that SDA significantly outperforms the existing function prediction methods for structural domains in terms of coverage and maximum F-measure.
结构域是蛋白质的进化和功能单位,在比较基因组学和功能基因组学中起着关键作用。以高可靠性对结构域功能进行计算分配对于理解全蛋白功能至关重要。然而,传统上功能注释是分配给全长蛋白质,而不是将特定功能与各个结构域相关联。在本文中,我们提出了结构域注释(SDA),这是一种预测SCOP结构域功能的新型计算方法。SDA方法将异构信息源集成在一起,包括基于结构比对的蛋白质-SCOP映射特征、InterPro2GO映射信息、PSSM图谱和序列邻域特征,并结合了贝叶斯网络。通过使用SDA对SCOP结构域进行大规模基因本体术语注释,我们获得了一个SCOP结构域到基因本体映射的数据库,其中包含SCOPe 2.03中约166,900个结构域中的约162,000个(>97%)及其预测的基因本体功能。我们使用单结构域蛋白质数据集和来自不同物种的独立数据集对SDA进行了基准测试。比较研究表明,SDA在覆盖率和最大F值方面显著优于现有的结构域功能预测方法。