Enot David P, Beckmann Manfred, Overy David, Draper John
Institute of Biological Sciences, University of Wales, Aberystwyth SY23 3DA, United Kingdom.
Proc Natl Acad Sci U S A. 2006 Oct 3;103(40):14865-70. doi: 10.1073/pnas.0605152103. Epub 2006 Sep 21.
Powerful algorithms are required to deal with the dimensionality of metabolomics data. Although many achieve high classification accuracy, the models they generate have limited value unless it can be demonstrated that they are reproducible and statistically relevant to the biological problem under investigation. Random forest (RF) generates models, without any requirement for dimensionality reduction or feature selection, in which individual variables are ranked for significance and displayed in an explicit manner. In metabolome fingerprinting by mass spectrometry, each metabolite can be represented by signals at several m/z. Exploiting a prior understanding of expected biochemical differences between sample classes, we aimed to develop meaningful metrics relevant to the significance both of the overall RF model and individual, potentially explanatory, signals. Pair-wise comparison of related plant genotypes with strong phenotypic differences demonstrated that robust models are not only reproducible but also logically structured, highlighting correlated m/z derived from just a small number of explanatory metabolites reflecting the biological differences between sample classes. RF models were also generated by using groupings of samples known to be increasingly phenotypically similar. Although classification accuracy was often reasonable, we demonstrated reproducibly in both Arabidopsis and potato a performance threshold based on margin statistics beyond which such models showed little structure indicative of either generalizability or further biological interpretability. In a multiclass problem using 25 Arabidopsis genotypes, despite the complicating effects of ecotype background and secondary metabolome perturbations common to several mutations, the ranking of metabolome signals by RF provided scope for deeper interpretability.
需要强大的算法来处理代谢组学数据的维度。尽管许多算法能实现较高的分类准确率,但它们生成的模型价值有限,除非能证明其具有可重复性且与所研究的生物学问题具有统计学相关性。随机森林(RF)生成模型时无需进行降维或特征选择,其中各个变量会按重要性排序并以明确的方式显示。在通过质谱进行代谢组指纹分析时,每种代谢物可由几个质荷比处的信号表示。利用对样本类别之间预期生化差异的先验理解,我们旨在开发与整体RF模型以及个体潜在解释性信号的重要性相关的有意义指标。对具有强烈表型差异的相关植物基因型进行成对比较表明,稳健的模型不仅具有可重复性,而且结构合理,突出了仅来自少数解释性代谢物的相关质荷比,这些代谢物反映了样本类别之间的生物学差异。还通过使用已知表型越来越相似的样本分组来生成RF模型。尽管分类准确率通常较为合理,但我们在拟南芥和马铃薯中均反复证明了基于边际统计的性能阈值,超过该阈值,此类模型几乎没有显示出表明可推广性或进一步生物学可解释性的结构。在一个使用25种拟南芥基因型的多类问题中,尽管生态型背景和几种突变共有的次生代谢组扰动具有复杂影响,但RF对代谢组信号的排序为更深入的可解释性提供了空间。