Freyhult Eva, Prusis Peteris, Lapinsh Maris, Wikberg Jarl E S, Moulton Vincent, Gustafsson Mats G
The Linnaeus Centre for Bioinformatics, Uppsala University, Box 598, S-751 24 Uppsala, Sweden.
BMC Bioinformatics. 2005 Mar 10;6:50. doi: 10.1186/1471-2105-6-50.
Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insights into various forms of bio-molecular interactions. The proteochemometric models are multivariate regression models that predict binding affinity for a particular combination of features of the ligand and protein. Although proteochemometric models have already offered interesting results in various studies, no detailed statistical evaluation of their average predictive power has been performed. In particular, variable subset selection performed to date has always relied on using all available examples, a situation also encountered in microarray gene expression data analysis.
A methodology for an unbiased evaluation of the predictive power of proteochemometric models was implemented and results from applying it to two of the largest proteochemometric data sets yet reported are presented. A double cross-validation loop procedure is used to estimate the expected performance of a given design method. The unbiased performance estimates (P2) obtained for the data sets that we consider confirm that properly designed single proteochemometric models have useful predictive power, but that a standard design based on cross validation may yield models with quite limited performance. The results also show that different commercial software packages employed for the design of proteochemometric models may yield very different and therefore misleading performance estimates. In addition, the differences in the models obtained in the double CV loop indicate that detailed chemical interpretation of a single proteochemometric model is uncertain when data sets are small.
The double CV loop employed offer unbiased performance estimates about a given proteochemometric modelling procedure, making it possible to identify cases where the proteochemometric design does not result in useful predictive models. Chemical interpretations of single proteochemometric models are uncertain and should instead be based on all the models selected in the double CV loop employed here.
蛋白质化学计量学是一种新方法,可直接从实际相互作用测量数据预测蛋白质功能,而无需三维结构信息。已报道的几种配体 - 受体相互作用的蛋白质化学计量模型已对各种生物分子相互作用形式产生了重要见解。蛋白质化学计量模型是多元回归模型,可预测配体和蛋白质特定特征组合的结合亲和力。尽管蛋白质化学计量模型在各种研究中已给出有趣结果,但尚未对其平均预测能力进行详细的统计评估。特别是,迄今为止进行的变量子集选择一直依赖于使用所有可用示例,这在微阵列基因表达数据分析中也会遇到。
实施了一种用于无偏评估蛋白质化学计量模型预测能力的方法,并展示了将其应用于两个迄今报道的最大蛋白质化学计量数据集的结果。使用双交叉验证循环程序来估计给定设计方法的预期性能。我们考虑的数据集获得的无偏性能估计(P2)证实,设计合理的单个蛋白质化学计量模型具有有用的预测能力,但基于交叉验证的标准设计可能产生性能相当有限的模型。结果还表明,用于设计蛋白质化学计量模型的不同商业软件包可能产生非常不同且因此具有误导性的性能估计。此外,在双CV循环中获得的模型差异表明,当数据集较小时,单个蛋白质化学计量模型的详细化学解释是不确定的。
所采用的双CV循环提供了关于给定蛋白质化学计量建模程序的无偏性能估计,使得能够识别蛋白质化学计量设计未产生有用预测模型的情况。单个蛋白质化学计量模型的化学解释是不确定的,而应基于此处采用的双CV循环中选择的所有模型。