MacCallum Robert M
Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden.
Bioinformatics. 2004 Aug 4;20 Suppl 1:i224-31. doi: 10.1093/bioinformatics/bth913.
Current approaches to contact map prediction in proteins have focused on amino acid conservation and patterns of mutation at sequentially distant positions. This sequence information is poorly understood and very little progress has been made in this area during recent years.
In this study, an observation of 'striped' sequence patterns across beta-sheets prompted the development of a new type of contact map predictor. Computer program code was evolved with an evolutionary algorithm (genetic programming) to select residues and residue pairs likely to make contacts based solely on local sequence patterns extracted with the help of self-organizing maps. The mean prediction accuracy is 27% on a validation set of 156 domains up to 400 residues in length, where contacts are separated by at least 8 residues and length/10 pairs are predicted. The retrospective accuracy on a set of 15 CASP5 targets is 27% and 14% for length/10 and length/2 predicted pairs, respectively (both using a minimum residue separation of 24). This compares favourably to the equivalent 21% and 13% obtained for the best automated contact prediction methods at CASP5. The results suggest that protein architectures impose regularities in local sequence environments. Other sources of information, such as correlated/compensatory mutations, may further improve accuracy.
A web-based prediction service is available at http://www.sbc.su.se/~maccallr/contactmaps
目前蛋白质中接触图预测的方法主要集中在氨基酸保守性以及序列上相距较远位置的突变模式。这种序列信息的理解还很有限,并且近年来在该领域几乎没有取得什么进展。
在本研究中,对β折叠上“条纹状”序列模式的观察促使开发了一种新型的接触图预测器。利用进化算法(遗传编程)对计算机程序代码进行演化,以仅基于借助自组织图提取的局部序列模式来选择可能形成接触的残基和残基对。在一个由156个长度达400个残基的结构域组成的验证集上,平均预测准确率为27%,其中接触残基之间至少相隔8个残基,且预测长度/10的配对。对于一组15个CASP5目标,长度/10和长度/2预测配对的回顾性准确率分别为27%和14%(两者均使用至少相隔24个残基的最小分离距离)。这与CASP5中最佳自动接触预测方法获得的21%和13%相比具有优势。结果表明蛋白质结构在局部序列环境中呈现出规律性。其他信息来源,如相关/补偿性突变,可能会进一步提高准确率。