Department of Plant Sciences, Weizmann Institute of Science, Rehovot, Israel.
Proteins. 2009 Aug 1;76(2):365-74. doi: 10.1002/prot.22352.
Database-scale analysis was performed to determine whether structural models, based on remote homologues, are effective in predicting 3D transition metal binding sites in proteins directly from translated gene sequences. The extent by which side chain modeling alone reduces sensitivity and selectivity is shown to be <10%. Surprisingly, selectivity was not dependent on the level of sequence homology between template and target, or on the presence of a metal ion in the structural template. Applying a modification of the CHED algorithm (Babor et al., Proteins 2008;70:208-217) and machine learning filters, a selectivity of approximately 90% was achieved for protein sequences using unrelated structural templates over a sequence identity range of 18-100%. Below approximately 18% identity, the number of analyzable target-template pairs and predictability of metal binding sites falls off sharply. A full third of structural templates were found to have target partners only in the remote homology range of 18-30%. In this range, nonmetal-binding templates are calculated to be the majority and serve to predict with 50% sensitivity at the geometric level. Overall, sensitivity at the geometric level for targets having templates in the 18-30% sequence identity range is 73%, with an average of one false positive site per true site. Protein sequences described as "unknown" in the UniProt database and composed largely of unidentified genome project sequences were studied and metal binding sites predicted. A web server for prediction of metal binding sites from protein sequence is provided.
进行了数据库规模的分析,以确定基于远程同源物的结构模型是否能够有效地直接从翻译后的基因序列预测蛋白质中的三维过渡金属结合位点。结果表明,仅通过侧链建模降低敏感性和选择性的程度<10%。令人惊讶的是,选择性不依赖于模板和目标之间的序列同源性水平,也不依赖于结构模板中是否存在金属离子。应用 CHED 算法(Babor 等人,Proteins 2008;70:208-217)和机器学习滤波器的修改版,使用不相关的结构模板,在 18-100%的序列同一性范围内,对蛋白质序列的选择性约为 90%。在大约 18%的同一性以下,可分析的目标-模板对的数量和金属结合位点的可预测性急剧下降。发现有三分之一的结构模板只有在 18-30%的远程同源范围内才有目标伴侣。在这个范围内,非金属结合模板被计算为大多数,并以 50%的敏感性在几何水平上进行预测。总体而言,在 18-30%序列同一性范围内具有模板的目标的几何水平敏感性为 73%,每个真实位点平均有一个假阳性位点。研究了在 UniProt 数据库中被描述为“未知”的蛋白质序列,这些序列主要由未识别的基因组项目序列组成,并预测了金属结合位点。提供了一个用于从蛋白质序列预测金属结合位点的网络服务器。