Variable selection for binary classification using error rate p-values applied to metabolomics data.

作者信息

van Reenen Mari, Reinecke Carolus J, Westerhuis Johan A, Venter J Hendrik

机构信息

Centre for Human Metabolomics, Faculty of Natural Sciences, North-West University (Potchefstroom Campus), Private Bag X6001, Potchefstroom, South Africa.

Department of Statistics, Faculty of Natural Sciences, North-West University (Potchefstroom Campus), Private Bag X6001, Potchefstroom, South Africa.

出版信息

BMC Bioinformatics. 2016 Jan 14;17:33. doi: 10.1186/s12859-015-0867-7.

DOI:10.1186/s12859-015-0867-7

PMID:26763892

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4712617/

Abstract

BACKGROUND

Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp.

RESULTS

We show that non-parametric hypothesis testing, based on minimum classification error rates as test statistics, can find statistically significantly shifted variables. The discriminatory ability of variables becomes more apparent when error rates are evaluated based on their corresponding p-values, as relatively high error rates can still be statistically significant. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. ERp retains (if known) or reveals (if unknown) the shift direction, aiding in biological interpretation. The threshold resulting in the minimum error rate can immediately be used to classify new subjects. We use NMR generated metabolomics data to illustrate how ERp is able to discriminate subjects diagnosed with Mycobacterium tuberculosis infected meningitis from a control group. The list of discriminatory variables produced by ERp contains all biologically relevant variables with appropriate shift directions discussed in the original paper from which this data is taken.

CONCLUSIONS

ERp performs variable selection and classification, is non-parametric and aids biological interpretation while handling unequal group sizes and misclassification costs. All this is achieved by a single approach which is easy to perform and interpret. ERp has the potential to address many other characteristics of metabolomics data. Future research aims to extend ERp to account for a large proportion of observations below the detection limit, as well as expand on interactions between variables.

摘要

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/340a/4712617/f6d500fa16c8/12859_2015_867_Fig1_HTML.jpg

相似文献

Variable selection for binary classification using error rate p-values applied to metabolomics data.

BMC Bioinformatics. 2016 Jan 14;17:33. doi: 10.1186/s12859-015-0867-7.

Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp.

BMC Bioinformatics. 2017 Feb 2;18(1):83. doi: 10.1186/s12859-017-1480-8.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

Performance of variable selection methods using stability-based selection.

BMC Res Notes. 2017 Apr 4;10(1):143. doi: 10.1186/s13104-017-2461-8.

A dynamic channel selection strategy for dense-array ERP classification.

IEEE Trans Biomed Eng. 2009 Apr;56(4):1040-51. doi: 10.1109/TBME.2008.2006985. Epub 2008 Oct 31.

Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles.

Anal Chem. 2009 Apr 1;81(7):2581-90. doi: 10.1021/ac802514y.

A variable selection approach in the multivariate linear model: an application to LC-MS metabolomics data.

Stat Appl Genet Mol Biol. 2018 Sep 8;17(5):/j/sagmb.2018.17.issue-5/sagmb-2017-0077/sagmb-2017-0077.xml. doi: 10.1515/sagmb-2017-0077.

On the overestimation of random forest's out-of-bag error.

PLoS One. 2018 Aug 6;13(8):e0201904. doi: 10.1371/journal.pone.0201904. eCollection 2018.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Computational approaches to metabolomics.

Methods Mol Biol. 2010;593:283-313. doi: 10.1007/978-1-60327-194-3_14.

引用本文的文献

Cerebrospinal Fluid Amino Acid Profiling of Pediatric Cases with Tuberculous Meningitis.

Front Neurosci. 2017 Sep 26;11:534. doi: 10.3389/fnins.2017.00534. eCollection 2017.

Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp.

BMC Bioinformatics. 2017 Feb 2;18(1):83. doi: 10.1186/s12859-017-1480-8.

Recent Advances in NMR-Based Metabolomics.

Anal Chem. 2017 Jan 3;89(1):490-510. doi: 10.1021/acs.analchem.6b04420. Epub 2016 Dec 2.

本文引用的文献

Exact confidence interval estimation for the Youden index and its corresponding optimal cut-point.

Comput Stat Data Anal. 2012 May 1;56(5):1103-1114. doi: 10.1016/j.csda.2010.11.023. Epub 2010 Dec 7.

A hypothetical astrocyte-microglia lactate shuttle derived from a H NMR metabolomics analysis of cerebrospinal fluid from a cohort of South African children with tuberculous meningitis.

Metabolomics. 2015;11(4):822-837. doi: 10.1007/s11306-014-0741-z. Epub 2014 Oct 11.

A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.

Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.

Joint confidence region estimation for area under ROC curve and Youden index.

Stat Med. 2014 Mar 15;33(6):985-1000. doi: 10.1002/sim.5992. Epub 2013 Sep 30.

Translational biomarker discovery in clinical metabolomics: an introductory tutorial.

Metabolomics. 2013 Apr;9(2):280-299. doi: 10.1007/s11306-012-0482-9. Epub 2012 Dec 4.

Chemometrics in metabolomics--a review in human disease diagnosis.

Anal Chim Acta. 2010 Feb 5;659(1-2):23-33. doi: 10.1016/j.aca.2009.11.042. Epub 2009 Nov 22.

Bias in sensitivity and specificity caused by data-driven selection of optimal cutoff values: mechanisms, magnitude, and solutions.

Clin Chem. 2008 Apr;54(4):729-37. doi: 10.1373/clinchem.2007.096032. Epub 2008 Feb 7.

Metabolomics: a global biochemical approach to drug response and disease.

Annu Rev Pharmacol Toxicol. 2008;48:653-83. doi: 10.1146/annurev.pharmtox.48.113006.094715.

Estimation of the Youden Index and its associated cutoff point.

Biom J. 2005 Aug;47(4):458-72. doi: 10.1002/bimj.200410135.

Statistics review 14: Logistic regression.

Crit Care. 2005 Feb;9(1):112-8. doi: 10.1186/cc3045. Epub 2005 Jan 13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验