Eyal Eran, Frenkel-Morgenstern Milana, Sobolev Vladimir, Pietrokovski Shmuel
Department of Plant Sciences, Weizmann Institute of Science, Rehovot 76100, Israel.
Proteins. 2007 Apr 1;67(1):142-53. doi: 10.1002/prot.21223.
We present a new structurally derived pair-to-pair substitution matrix (P2PMAT). This matrix is constructed from a very large amount of integrated high quality multiple sequence alignments (Blocks) and protein structures. It evaluates the likelihoods of all 160,000 pair-to-pair substitutions. P2PMAT matrix implicitly accounts for evolutionary conservation, correlated mutations, and residue-residue contact potentials. The usefulness of the matrix for structural predictions is shown in this article. Predicting protein residue-residue contacts from sequence information alone, by our method (P2PConPred) is particularly accurate in the protein cores, where it performs better than other basic contact prediction methods (increasing accuracy by 25-60%). The method mean accuracy for protein cores is 24% for 59 diverse families and 34% for a subset of proteins shorter than 100 residues. This is above the level that was recently shown to be sufficient to significantly improve ab initio protein structure prediction. We also demonstrate the ability of our approach to identify native structures within large sets of (300-2000) protein decoys. On the basis of evolutionary information alone our method ranks the native structure in the top 0.3% of the decoys in 4/10 of the sets, and in 8/10 of sets the native structure is ranked in the top 10% of the decoys. The method can, thus, be used to assist filtering wrong models, complementing traditional scoring functions.
我们提出了一种新的基于结构推导的两两替换矩阵(P2PMAT)。该矩阵由大量整合的高质量多序列比对(Blocks)和蛋白质结构构建而成。它评估了所有160,000种两两替换的可能性。P2PMAT矩阵隐含地考虑了进化保守性、相关突变以及残基-残基接触势。本文展示了该矩阵在结构预测方面的实用性。通过我们的方法(P2PConPred)仅从序列信息预测蛋白质残基-残基接触在蛋白质核心区域特别准确,其表现优于其他基本的接触预测方法(准确率提高25%-60%)。对于59个不同家族,该方法在蛋白质核心区域的平均准确率为24%,对于长度小于100个残基的蛋白质子集,平均准确率为34%。这高于最近显示足以显著改进从头算蛋白质结构预测的水平。我们还展示了我们的方法在大量(300-2000个)蛋白质诱饵集中识别天然结构的能力。仅基于进化信息,我们的方法在4/10的数据集中将天然结构排在诱饵的前0.3%,在8/10的数据集中,天然结构排在诱饵的前10%。因此,该方法可用于辅助筛选错误模型,补充传统评分函数。