蛋白质结构域的贝叶斯数据挖掘提供了一种高效的预测算法和新见解。

Bayesian data mining of protein domains gives an efficient predictive algorithm and new insight.

作者信息

Joshi Rajani R, Samant Vivekanand V

机构信息

Department of Mathematics, Indian Institute of Technology Bombay, Powai, Mumbai, 400 076, India.

出版信息

J Mol Model. 2007 Jan;13(1):275-82. doi: 10.1007/s00894-006-0141-z. Epub 2006 Oct 7.

DOI:10.1007/s00894-006-0141-z

PMID:17028865

Abstract

Identification of structural domains in uncharacterized protein sequences is important in the prediction of protein tertiary folds and functional sites, and hence in designing biologically active molecules. We present a new predictive computational method of classifying a protein into single, two continuous or two discontinuous domains using Bayesian Data Mining. The algorithm requires only the primary sequence and computer-predicted secondary structure. It incorporates correlation patterns between certain 3-dimensional motifs and some local helical folds found conserved in the vicinity of protein domains with high statistical confidence. The prediction of domain-class by this computationally simple and fast method shows good accuracy of prediction-average accuracies 83.3% for single domain, 60% for two continuous and 65.7% for two discontinuous domain proteins. Experiments on the large validation sample show its performance to be significantly better than that of DGS and DomSSEA. Computations of Bayesian probabilities show important features in terms of correlation of certain conserved patterns of secondary folds and tertiary motifs and give new insight. Applications for improved accuracy of predicting domain boundary points relevant to protein structural and functional modeling are also highlighted.

摘要

识别未表征蛋白质序列中的结构域对于预测蛋白质三级结构和功能位点至关重要，因此对于设计生物活性分子也很重要。我们提出了一种新的预测计算方法，使用贝叶斯数据挖掘将蛋白质分类为单结构域、两个连续结构域或两个不连续结构域。该算法仅需要蛋白质一级序列和计算机预测的二级结构。它结合了某些三维基序与在蛋白质结构域附近发现的一些局部螺旋折叠之间的相关模式，且具有较高的统计置信度。通过这种计算简单且快速的方法预测结构域类别显示出良好的预测准确性——单结构域蛋白质的平均预测准确率为83.3%，两个连续结构域蛋白质为60%，两个不连续结构域蛋白质为65.7%。在大型验证样本上的实验表明，其性能明显优于DGS和DomSSEA。贝叶斯概率计算显示了某些二级折叠和三级基序保守模式相关性方面的重要特征，并提供了新的见解。还强调了该方法在提高与蛋白质结构和功能建模相关的结构域边界点预测准确性方面的应用。