Smit Suzanne, Hoefsloot Huub C J, Smilde Age K
Swammerdam Institute for Life Sciences, Universiteit van Amsterdam - Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands.
J Chromatogr B Analyt Technol Biomed Life Sci. 2008 Apr 15;866(1-2):77-88. doi: 10.1016/j.jchromb.2007.10.042. Epub 2007 Nov 4.
This review discusses data analysis strategies for the discovery of biomarkers in clinical proteomics. Proteomics studies produce large amounts of data, characterized by few samples of which many variables are measured. A wealth of classification methods exists for extracting information from the data. Feature selection plays an important role in reducing the dimensionality of the data prior to classification and in discovering biomarker leads. The question which classification strategy works best is yet unanswered. Validation is a crucial step for biomarker leads towards clinical use. Here we only discuss statistical validation, recognizing that biological and clinical validation is of utmost importance. First, there is the need for validated model selection to develop a generalized classifier that predicts new samples correctly. A cross-validation loop that is wrapped around the model development procedure assesses the performance using unseen data. The significance of the model should be tested; we use permutations of the data for comparison with uninformative data. This procedure also tests the correctness of the performance validation. Preferably, a new set of samples is measured to test the classifier and rule out results specific for a machine, analyst, laboratory or the first set of samples. This is not yet standard practice. We present a modular framework that combines feature selection, classification, biomarker discovery and statistical validation; these data analysis aspects are all discussed in this review. The feature selection, classification and biomarker discovery modules can be incorporated or omitted to the preference of the researcher. The validation modules, however, should not be optional. In each module, the researcher can select from a wide range of methods, since there is not one unique way that leads to the correct model and proper validation. We discuss many possibilities for feature selection, classification and biomarker discovery. For validation we advice a combination of cross-validation and permutation testing, a validation strategy supported in the literature.
本综述讨论了临床蛋白质组学中生物标志物发现的数据分析策略。蛋白质组学研究产生大量数据,其特点是样本数量少但测量的变量众多。有大量分类方法可用于从数据中提取信息。特征选择在分类前降低数据维度以及发现生物标志物线索方面发挥着重要作用。哪种分类策略效果最佳的问题尚未得到解答。验证是生物标志物走向临床应用的关键步骤。在此我们仅讨论统计验证,同时认识到生物学和临床验证至关重要。首先,需要进行经过验证的模型选择,以开发能够正确预测新样本的通用分类器。围绕模型开发过程的交叉验证循环使用未见数据评估性能。应测试模型的显著性;我们使用数据的排列与无信息数据进行比较。此过程还测试性能验证的正确性。最好测量一组新样本以测试分类器并排除特定于某台机器、分析人员、实验室或第一组样本的结果。这尚未成为标准做法。我们提出了一个模块化框架,该框架结合了特征选择、分类、生物标志物发现和统计验证;本综述将讨论所有这些数据分析方面。特征选择、分类和生物标志物发现模块可根据研究人员的偏好纳入或省略。然而,验证模块不应是可选的。在每个模块中,研究人员可以从多种方法中进行选择,因为不存在一种唯一的方法能得出正确的模型和恰当的验证。我们讨论了特征选择、分类和生物标志物发现的多种可能性。对于验证,我们建议结合交叉验证和排列测试,这是文献中支持的一种验证策略。