Howard Hughes Medical Institute, Department of Biochemistry, and Molecular and Cellular Biology Program, University of Washington, Seattle, WA 98195.
Proc Natl Acad Sci U S A. 2013 Sep 24;110(39):15674-9. doi: 10.1073/pnas.1314045110. Epub 2013 Sep 5.
Recently developed methods have shown considerable promise in predicting residue-residue contacts in protein 3D structures using evolutionary covariance information. However, these methods require large numbers of evolutionarily related sequences to robustly assess the extent of residue covariation, and the larger the protein family, the more likely that contact information is unnecessary because a reasonable model can be built based on the structure of a homolog. Here we describe a method that integrates sequence coevolution and structural context information using a pseudolikelihood approach, allowing more accurate contact predictions from fewer homologous sequences. We rigorously assess the utility of predicted contacts for protein structure prediction using large and representative sequence and structure databases from recent structure prediction experiments. We find that contact predictions are likely to be accurate when the number of aligned sequences (with sequence redundancy reduced to 90%) is greater than five times the length of the protein, and that accurate predictions are likely to be useful for structure modeling if the aligned sequences are more similar to the protein of interest than to the closest homolog of known structure. These conditions are currently met by 422 of the protein families collected in the Pfam database.
最近开发的方法利用进化协方差信息在预测蛋白质 3D 结构中的残基-残基接触方面显示出了相当大的潜力。然而,这些方法需要大量进化相关的序列来稳健地评估残基协变的程度,而且蛋白质家族越大,接触信息就越不重要,因为可以基于同源物的结构构建合理的模型。在这里,我们描述了一种使用伪似然方法整合序列共进化和结构上下文信息的方法,允许从较少的同源序列中进行更准确的接触预测。我们使用来自最近结构预测实验的大型和代表性序列和结构数据库,严格评估预测接触对蛋白质结构预测的效用。我们发现,当对齐序列的数量(序列冗余减少到 90%)大于蛋白质长度的五倍时,接触预测很可能是准确的,如果对齐序列与感兴趣的蛋白质比与已知结构的最接近同源物更相似,那么准确的预测很可能对结构建模有用。目前,Pfam 数据库中收集的 422 个蛋白质家族都满足这些条件。