Ghalwash Mohamed F, Dunker A Keith, Obradović Zoran
Center for Data Analytics and Biomedical Informatics, Computer and Information Sciences Department, College of Science and Technology, Temple University, Philadelphia, PA 19122, USA.
Mol Biosyst. 2012 Jan;8(1):381-91. doi: 10.1039/c1mb05373f. Epub 2011 Nov 21.
A grand challenge in the proteomics and structural genomics era is the prediction of protein structure, including identification of those proteins that are partially or wholly unstructured. A number of predictors for identification of intrinsically disordered proteins (IDPs) have been developed over the last decade, but none can be taken as a fully reliable on its own. Using a single model for prediction is typically inadequate because prediction based on only the most accurate model ignores model uncertainty. In this paper, we present an empirical method to specify and measure uncertainty associated with disorder predictions. In particular, we analyze the uncertainty in the reference model itself and the uncertainty in data. This is achieved by training a set of models and developing several meta predictors on top of them. The best meta predictor achieved comparable or better results than any other single model, suggesting that incorporating different aspects of protein disorder prediction is important for the disorder prediction task. In addition, the best meta-predictor had more balanced sensitivity and specificity than any individual model. We also assessed the effects of changes in disorder prediction as a function of changes in the protein sequence. For collections of homologous sequences, we found that mutations caused many of the predicted disordered residues to be flipped to be predicted as ordered residues, while the reverse was observed much less frequently. These results suggest that disorder tendencies are more sensitive to allowed mutations than structure tendencies and the conservation of disorder is indeed less stable than conservation of structure.
five meta-predictors and four single models developed for this study will be publicly freely accessible for non-commercial use.
蛋白质组学和结构基因组学时代的一个重大挑战是蛋白质结构预测,包括识别部分或完全无结构的蛋白质。在过去十年中,已经开发了许多用于识别内在无序蛋白质(IDP)的预测器,但没有一个可以单独被视为完全可靠。使用单一模型进行预测通常是不够的,因为仅基于最准确模型的预测忽略了模型的不确定性。在本文中,我们提出了一种实证方法来指定和测量与无序预测相关的不确定性。特别是,我们分析了参考模型本身的不确定性和数据中的不确定性。这是通过训练一组模型并在它们之上开发几个元预测器来实现的。最佳元预测器取得了与任何其他单一模型相当或更好的结果,这表明纳入蛋白质无序预测的不同方面对于无序预测任务很重要。此外,最佳元预测器的敏感性和特异性比任何单个模型都更平衡。我们还评估了无序预测变化作为蛋白质序列变化函数的影响。对于同源序列集合,我们发现突变导致许多预测的无序残基被翻转预测为有序残基,而相反的情况则很少观察到。这些结果表明,无序倾向比结构倾向对允许的突变更敏感,并且无序的保守性确实比结构的保守性更不稳定。
为该研究开发的五个元预测器和四个单一模型将可供非商业用途免费公开访问。