Smit Suzanne, van Breemen Mariëlle J, Hoefsloot Huub C J, Smilde Age K, Aerts Johannes M F G, de Koster Chris G
Swammerdam Institute for Life Sciences, Universiteit van-Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands.
Anal Chim Acta. 2007 Jun 5;592(2):210-7. doi: 10.1016/j.aca.2007.04.043. Epub 2007 Apr 27.
A strategy is presented for the statistical validation of discrimination models in proteomics studies. Several existing tools are combined to form a solid statistical basis for biomarker discovery that should precede a biochemical validation of any biomarker. These tools consist of permutation tests, single and double cross-validation. The cross-validation steps can simply be combined with a new variable selection method, called rank products. The strategy is especially suited for the low-samples-to-variables-ratio (undersampling) case, as is often encountered in proteomics and metabolomics studies. As a classification method, principal component discriminant analysis is used; however, the methodology can be used with any classifier. A dataset containing serum samples from Gaucher patients and healthy controls serves as a test case. Double cross-validation shows that the sensitivity of the model is 89% and the specificity 90%. Potential putative biomarkers are identified using the novel variable selection method. Results from permutation tests support the choice of double cross-validation as the tool for determining error rates when the modelling procedure involves a tuneable parameter. This shows that even cross-validation does not guarantee unbiased results. The validation of discrimination models with a combination of permutation tests and double cross-validation helps to avoid erroneous results which may result from the undersampling.
本文提出了一种蛋白质组学研究中判别模型统计验证的策略。将几种现有工具结合起来,为生物标志物发现形成坚实的统计基础,这应先于任何生物标志物的生化验证。这些工具包括排列检验、单重和双重交叉验证。交叉验证步骤可以简单地与一种称为秩乘积的新变量选择方法相结合。该策略特别适用于蛋白质组学和代谢组学研究中经常遇到的低样本与变量比率(欠采样)情况。作为一种分类方法,使用主成分判别分析;然而,该方法可与任何分类器一起使用。一个包含戈谢病患者和健康对照血清样本的数据集用作测试案例。双重交叉验证表明,该模型的灵敏度为89%,特异性为90%。使用新型变量选择方法识别潜在的假定生物标志物。排列检验的结果支持将双重交叉验证作为在建模过程涉及可调参数时确定错误率的工具。这表明即使是交叉验证也不能保证无偏结果。结合排列检验和双重交叉验证对判别模型进行验证,有助于避免欠采样可能导致的错误结果。