Gupta Kshitiz, Sehgal Vivek, Levchenko Andre
The Whitaker Institute for Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA.
BMC Struct Biol. 2008 Oct 3;8:40. doi: 10.1186/1472-6807-8-40.
Prediction of function of proteins on the basis of structure and vice versa is a partially solved problem, largely in the domain of biophysics and biochemistry. This underlies the need of computational and bioinformatics approach to solve the problem. Large and organized latent knowledge on protein classification exists in the form of independently created protein classification databases. By creating probabilistic maps between classes of structural classification databases (e.g. SCOP) and classes of functional classification databases (e.g. PROSITE), structure and function of proteins could be probabilistically related.
We demonstrate that PROSITE and SCOP have significant semantic overlap, in spite of independent classification schemes. By training classifiers of SCOP using classes of PROSITE as attributes and vice versa, accuracy of Support Vector Machine classifiers for both SCOP and PROSITE was improved. Novel attributes, 2-D elastic profiles and Blocks were used to improve time complexity and accuracy. Many relationships were extracted between classes of SCOP and PROSITE using decision trees.
We demonstrate that presented approach can discover new probabilistic relationships between classes of different taxonomies and render a more accurate classification. Extensive mappings between existing protein classification databases can be created to link the large amount of organized data. Probabilistic maps were created between classes of SCOP and PROSITE allowing predictions of structure using function, and vice versa. In our experiments, we also found that functions are indeed more strongly related to structure than are structure to functions.
基于结构预测蛋白质功能以及反之亦然,这是一个部分得到解决的问题,主要属于生物物理学和生物化学领域。这就产生了采用计算和生物信息学方法来解决该问题的需求。关于蛋白质分类的大量且有组织的潜在知识以独立创建的蛋白质分类数据库的形式存在。通过在结构分类数据库(如SCOP)的类别与功能分类数据库(如PROSITE)的类别之间创建概率图谱,蛋白质的结构和功能可以建立概率关联。
我们证明,尽管PROSITE和SCOP采用了独立的分类方案,但它们存在显著的语义重叠。通过使用PROSITE的类别作为属性来训练SCOP的分类器,反之亦然,SCOP和PROSITE的支持向量机分类器的准确性都得到了提高。使用新颖的属性,二维弹性轮廓和模块来提高时间复杂度和准确性。利用决策树提取了SCOP和PROSITE类别之间的许多关系。
我们证明所提出的方法可以发现不同分类法类别之间新的概率关系,并实现更准确的分类。可以在现有的蛋白质分类数据库之间创建广泛的映射,以链接大量有组织的数据。在SCOP和PROSITE的类别之间创建了概率图谱,从而允许利用功能预测结构,反之亦然。在我们的实验中,我们还发现功能与结构的关联确实比结构与功能的关联更为紧密。