Department of Preventive Medicine and Community Health, University of Texas Medical Branch (UTMB), Galveston, TX, USA.
Methods. 2013 May 15;61(1):73-85. doi: 10.1016/j.ymeth.2013.01.002. Epub 2013 Jan 12.
Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.
在后基因组时代的生物学实验中,可以产生大量复杂的数据,这使得实验人员难以从中提取有意义的信息。越来越多的是,一个适当控制的实验的成功依赖于一个强大的数据分析管道。在本文中,我们提出了一种分析多维数据的结构化方法,该方法依赖于生物信息学家和实验人员之间的紧密、双向沟通。本文提出了一种采用数据探索(可视化、图形和分析研究)、预处理、特征减少和使用机器学习进行监督分类的顺序方法。通过一个已经用于预测传染病结果风险的蛋白质组学数据分析示例来说明这种标准化方法。本文还提出并应用了模型选择和事后模型诊断策略来进行案例说明。我们讨论了在将监督分类应用于多维数据集时我们学到的一些实际经验,其中之一是在实现最佳建模性能时特征减少的重要性。