Joshi Rajani R, Samant Vivekanand V
Department of Mathematics, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India.
J Mol Model. 2006 Sep;12(6):943-52. doi: 10.1007/s00894-006-0116-0. Epub 2006 Apr 29.
We have found certain conserved motifs and secondary structural patterns present in the vicinity of interior domain boundary points (dbps) by a data-driven approach without any a priori constraint on the type and number of such features, and without any requirement of sequence homology. We have used these motifs and patterns to rerank the solutions obtained by the well-known domain guess by size (DGS) algorithm. We predict, overall, five solutions. The average accuracy of overall (i.e., top five) predictions by our method [domain boundary prediction using conserved patterns (DPCP)] has improved the average accuracy of the top five solutions of DGS from 71.74 to 82.88 %, in the case of two-continuous-domain proteins, and from 21.38 to 80.56 %, for two-discontinuous-domain proteins. Considering only the top solution, the gains in accuracy are from 0 to 72.74 % for two-continuous-domain proteins with chain lengths up to 300 residues, and from 0 to 62.85 % for those with up to 400 residues. In the case of discontinuous domains, top_min solutions (the minimum number of solutions required for predicting all dbps of a protein) of DPCP improve the average accuracy of DGS prediction from 12.5 to 76.3 % in proteins with chain lengths up to 300 residues, and from 13.33 to 70.84 % for proteins with up to 400 residues. In our validation experiments, the performance of DPCP was also found to be superior to that of domain identification from secondary structure element alignment (DomSSEA), the best method reported so far for efficient prediction of domain boundaries using predicted secondary structure. The average accuracies of the topmost solution of DomSSEA are 61 and 52 % for proteins with up to 300 residues and 400, respectively, in the case of continuous domains; the corresponding accuracies for the discontinuous case are 28 and 21 %.
我们通过一种数据驱动的方法,在没有对这些特征的类型和数量进行任何先验约束,也没有序列同源性要求的情况下,发现在内部结构域边界点(dbps)附近存在某些保守基序和二级结构模式。我们使用这些基序和模式对通过著名的按大小猜测结构域(DGS)算法获得的解决方案进行重新排序。总体而言,我们预测了五个解决方案。在双连续结构域蛋白的情况下,我们的方法[使用保守模式进行结构域边界预测(DPCP)]对总体(即前五个)预测的平均准确率已将DGS前五个解决方案的平均准确率从71.74%提高到82.88%;对于双不连续结构域蛋白,该准确率从21.38%提高到80.56%。仅考虑最佳解决方案时,对于链长高达300个残基的双连续结构域蛋白,准确率提高幅度为0至72.74%;对于链长高达400个残基的蛋白,准确率提高幅度为0至62.85%。在不连续结构域的情况下,DPCP的top_min解决方案(预测蛋白质所有dbps所需的最小解决方案数量)将链长高达300个残基的蛋白质中DGS预测的平均准确率从12.5%提高到76.3%,对于链长高达400个残基的蛋白质,该准确率从从13.33%提高到70.84%。在我们的验证实验中,还发现DPCP的性能优于基于二级结构元件比对的结构域识别方法(DomSSEA),DomSSEA是迄今为止报道的使用预测二级结构有效预测结构域边界的最佳方法。对于连续结构域,DomSSEA最佳解决方案的平均准确率在链长分别高达300个残基和400个残基的蛋白质中分别为61%和52%;在不连续结构域的情况下,相应的准确率分别为28%和21%。