Yu Xiaojing, Wang Chuan, Li Yixue
Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China.
BMC Bioinformatics. 2006 Apr 4;7:187. doi: 10.1186/1471-2105-7-187.
The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences.
To explore this problem, we adopted an approach based on the functional domain composition of proteins. Every protein was represented by a vector calculated from the domains in the PFAM database. The nearest neighbor algorithm (NNA) was used for classifying the quaternary structure of proteins from this information. The jackknife cross-validation test was performed on the non-redundant protein dataset in which the sequence identity was less than 25%. The overall success rate obtained is 75.17%. Additionally, to demonstrate the effectiveness of this method, we predicted the proteins in an independent dataset and achieved an overall success rate of 84.11%
Compared with the amino acid composition method and Blast, the results indicate that the domain composition approach may be a more effective and promising high-throughput method in dealing with this complicated problem in bioinformatics.
构成蛋白质的亚基数量和排列方式被称为四级结构。四级结构是一种重要的蛋白质属性,与蛋白质功能密切相关。具有四级结构的蛋白质被称为寡聚蛋白。寡聚蛋白参与各种生物过程,如新陈代谢、信号转导和染色体复制。因此,非常需要开发一些计算方法来根据蛋白质序列自动分类其四级结构。
为了探索这个问题,我们采用了一种基于蛋白质功能域组成的方法。每个蛋白质都由从PFAM数据库中的结构域计算得到的向量表示。最近邻算法(NNA)用于根据这些信息对蛋白质的四级结构进行分类。在序列同一性小于25%的非冗余蛋白质数据集上进行了留一法交叉验证测试。获得的总体成功率为75.17%。此外,为了证明该方法的有效性,我们在一个独立的数据集中预测蛋白质,总体成功率达到了84.11%。
与氨基酸组成方法和Blast相比,结果表明结构域组成方法可能是一种在处理生物信息学中这个复杂问题时更有效且有前景的高通量方法。