NMR Center, Ruđer Bošković Institute, Bijenička cesta 54, HR-10002, Zagreb, Croatia.
Adv Exp Med Biol. 2011;696:279-84. doi: 10.1007/978-1-4419-7046-6_28.
The testing of a bioinformatics algorithm on the training set is not the best indicator of its future performance because of the misleadingly optimistic results. The optimal method of testing is the calculation of error rate on an independent dataset (test set). We have tested the validity of the FOLD-RATE method for the prediction of protein folding rate constants [ln(k ( f ))] using sequences, structural class information and experimentally verified folding rate constants of the Protein Folding Database (PFD). PFD is a publicly accessible repository of thermodynamic and kinetic data of interest for the researchers of different profiles, standardized by the International Foldeomics Consortium. Our results show that when the standardized PFD dataset is used to test a protein fold rate prediction method, the estimation of validity may differ significantly.
由于误导性的乐观结果,在训练集上测试生物信息学算法并不是其未来性能的最佳指标。最佳的测试方法是在独立数据集(测试集)上计算错误率。我们已经使用序列、结构类别信息和蛋白质折叠数据库(PFD)中经过实验验证的折叠速率常数,测试了 FOLD-RATE 方法预测蛋白质折叠速率常数[ln(k(f))]的有效性。PFD 是一个公开的热力学和动力学数据存储库,对不同专业背景的研究人员都有兴趣,由国际折叠组学联盟标准化。我们的结果表明,当使用标准化的 PFD 数据集来测试蛋白质折叠速率预测方法时,有效性的估计可能会有很大差异。