University of Trento, Trento.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):203-13. doi: 10.1109/TCBB.2011.94. Epub 2011 May 16.
Prediction of binding sites from sequence can significantly help toward determining the function of uncharacterized proteins on a genomic scale. The task is highly challenging due to the enormous amount of alternative candidate configurations. Previous research has only considered this prediction problem starting from 3D information. When starting from sequence alone, only methods that predict the bonding state of selected residues are available. The sole exception consists of pattern-based approaches, which rely on very specific motifs and cannot be applied to discover truly novel sites. We develop new algorithmic ideas based on structured-output learning for determining transition-metal-binding sites coordinated by cysteines and histidines. The inference step (retrieving the best scoring output) is intractable for general output types (i.e., general graphs). However, under the assumption that no residue can coordinate more than one metal ion, we prove that metal binding has the algebraic structure of a matroid, allowing us to employ a very efficient greedy algorithm. We test our predictor in a highly stringent setting where the training set consists of protein chains belonging to SCOP folds different from the ones used for accuracy estimation. In this setting, our predictor achieves 56 percent precision and 60 percent recall in the identification of ligand-ion bonds.
从序列预测结合位点可以极大地帮助确定基因组范围内未表征蛋白质的功能。由于存在大量的替代候选构象,因此该任务极具挑战性。以前的研究仅从 3D 信息开始考虑此预测问题。仅从序列开始时,仅可使用预测选定残基键合状态的方法。唯一的例外是基于模式的方法,这些方法依赖于非常特定的基序,并且不能用于发现真正新颖的位点。我们基于结构化输出学习开发了新的算法思想,用于确定由半胱氨酸和组氨酸协调的过渡金属结合位点。对于一般输出类型(即一般图形),推理步骤(检索得分最高的输出)是难以处理的。但是,假设没有残基可以协调超过一个金属离子,我们证明金属结合具有拟阵的代数结构,这使我们能够使用非常有效的贪婪算法。我们在高度严格的设置中测试了我们的预测器,其中训练集由属于 SCOP 折叠的蛋白质链组成,这些链与用于准确性估计的折叠不同。在这种设置下,我们的预测器在识别配体-离子键时的精度为 56%,召回率为 60%。