van Reenen Mari, Westerhuis Johan A, Reinecke Carolus J, Venter J Hendrik
Centre for Human Metabolomics, Faculty of Natural Sciences, North-West University (Potchefstroom Campus), Private Bag X6001, Potchefstroom, South Africa.
Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH, Amsterdam, The Netherlands.
BMC Bioinformatics. 2017 Feb 2;18(1):83. doi: 10.1186/s12859-017-1480-8.
ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the associated p-values are significant they indicate discriminatory variables (i.e. informative metabolites). The p-values are calculated assuming a common continuous strictly increasing cumulative distribution under the null hypothesis. This assumption is violated when zero-valued observations can occur with positive probability, a characteristic of GC-MS metabolomics data, disqualifying ERp in this context. This paper extends ERp to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. This is achieved by allowing the null cumulative distribution function to take the form of a mixture between a jump at zero and a continuous strictly increasing function. The extended ERp approach is referred to as XERp.
XERp is no longer non-parametric, but its null distributions depend only on one parameter, the true proportion of zeros. Under the null hypothesis this parameter can be estimated by the proportion of zeros in the available data. XERp is shown to perform well with regard to bias and power. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context.
XERp takes into account the distributional structure of data with a probability mass at zero without requiring any knowledge of the detection limit of the metabolomics platform. XERp is able to identify variables that discriminate between two groups by simultaneously extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations. XERp uses simple rules to classify new subjects and a weight pair to adjust for unequal sample sizes or sensitivity and specificity requirements.
ERp是一种用于代谢组学数据的变量选择和分类方法。ERp基于对照组和实验组的数据,使用最小化分类错误率来检验两组变量分布无差异的原假设。如果相关的p值显著,则表明存在判别变量(即信息性代谢物)。p值是在原假设下假设一个共同的连续严格递增累积分布来计算的。当零值观测以正概率出现时,这一假设就会被违反,而这是气相色谱 - 质谱代谢组学数据的一个特征,这使得ERp在此情况下不适用。本文扩展了ERp以解决零值观测的两个来源:(i)反映样本中完全不存在代谢物的零值(真零值);以及(ii)反映低于检测限的测量值的零值。这是通过允许原累积分布函数采用在零处有跳跃和连续严格递增函数之间的混合形式来实现的。扩展后的ERp方法称为XERp。
XERp不再是非参数方法,但其原分布仅取决于一个参数,即零值的真实比例。在原假设下,这个参数可以通过可用数据中的零值比例来估计。结果表明XERp在偏差和功效方面表现良好。为了证明XERp的实用性,将其应用于一项关于婴幼儿结核性脑膜炎的代谢组学研究的气相色谱 - 质谱数据。我们发现XERp能够提供一份信息丰富的判别变量候选清单,同时在留一法交叉验证的情况下,对新对象达到令人满意的分类准确率。
XERp考虑了在零处具有概率质量的数据的分布结构,而无需任何关于代谢组学平台检测限的知识。XERp能够通过同时从零值比例差异和非零观测分布的偏移中提取信息来识别区分两组的变量。XERp使用简单规则对新对象进行分类,并使用权重对来调整不等样本量或灵敏度和特异性要求。