Rubingh Carina M, Bijlsma Sabina, Derks Eduard P P A, Bobeldijk Ivana, Verheij Elwin R, Kochhar Sunil, Smilde Age K
Business Unit Analytical Sciences, TNO Quality of Life, P.O. Box 360, 3700 AJ Zeist, The Netherlands.
BioAnalytical Science Department, Nestlé Research Center, P.O. Box 44, CH-1000 Lausanne 26, Switzerland.
Metabolomics. 2006;2(2):53-61. doi: 10.1007/s11306-006-0022-6. Epub 2006 Jul 11.
Statistical model validation tools such as cross-validation, jack-knifing model parameters and permutation tests are meant to obtain an objective assessment of the performance and stability of a statistical model. However, little is known about the performance of these tools for megavariate data sets, having, for instance, a number of variables larger than 10 times the number of subjects. The performance is assessed for megavariate metabolomics data, but the conclusions also carry over to proteomics, transcriptomics and many other research areas. Partial least squares discriminant analyses models were built for several LC-MS lipidomic training data sets of various numbers of lean and obese subjects. The training data sets were compared on their modelling performance and their predictability using a 10-fold cross-validation, a permutation test, and test data sets. A wide range of cross-validation error rates was found (from 7.5% to 16.3% for the largest trainings set and from 0% to 60% for the smallest training set) and the error rate increased when the number of subjects decreased. The test error rates varied from 5% to 50%. The smaller the number of subjects compared to the number of variables, the less the outcome of validation tools such as cross-validation, jack-knifing model parameters and permutation tests can be trusted. The result depends crucially on the specific sample of subjects that is used for modelling. The validation tools cannot be used as warning mechanism for problems due to sample size or to representativity of the sampling.
诸如交叉验证、刀切法模型参数和置换检验等统计模型验证工具旨在对统计模型的性能和稳定性进行客观评估。然而,对于多变量数据集(例如变量数量超过样本数量10倍的数据集),这些工具的性能却鲜为人知。本文评估了多变量代谢组学数据的性能,但所得结论同样适用于蛋白质组学、转录组学及许多其他研究领域。针对若干包含不同数量瘦人和肥胖受试者的液相色谱-质谱脂质组学训练数据集,构建了偏最小二乘判别分析模型。利用10倍交叉验证、置换检验和测试数据集,对训练数据集的建模性能和可预测性进行了比较。结果发现交叉验证错误率范围很广(最大训练集的错误率为7.5%至16.3%,最小训练集的错误率为0%至60%),且样本数量减少时错误率会增加。测试错误率在5%至50%之间。与变量数量相比,样本数量越少,诸如交叉验证、刀切法模型参数和置换检验等验证工具的结果就越不可信。结果很大程度上取决于用于建模的特定样本。验证工具不能用作因样本量或抽样代表性问题发出警告的机制。
BMC Bioinformatics. 2007-8-30
Anal Chim Acta. 2007-6-5
Methods Mol Biol. 2025
Anal Bioanal Chem. 2024-4
Stat Appl Genet Mol Biol. 2005
J Clin Endocrinol Metab. 2006-4
Anal Chem. 2005-10-15
J Chromatogr B Analyt Technol Biomed Life Sci. 2005-3-5
Bioinformatics. 2004-2-12