Suppr超能文献

非靶向代谢组学中多类表型鉴别分类器性能评估

Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics.

作者信息

Trainor Patrick J, DeFilippis Andrew P, Rai Shesh N

机构信息

Division of Cardiovascular Medicine, Department of Medicine, University of Louisville, 580 S. Preston St., Louisville, KY 40202, USA.

Department of Bioinformatics and Biostatistics, University of Louisville, 505 S. Hancock St., Louisville, KY 40202, USA.

出版信息

Metabolites. 2017 Jun 21;7(2):30. doi: 10.3390/metabo7020030.

Abstract

Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, -Nearest Neighbors (-NN), and Naïve Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naïve Bayes, sPLS-DA, Neural Networks, PLS-DA and -NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and -NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed.

摘要

统计分类是利用代谢组学数据研究表型分子决定因素的关键组成部分。尽管如此,针对代谢组学数据进行表型区分的分类技术准确性尚未进行全面而严格的评估。我们使用模拟和真实代谢组学数据集进行了这样的评估,比较了偏最小二乘判别分析(PLS - DA)、稀疏PLS - DA、随机森林、支持向量机(SVM)、人工神经网络、k近邻(k - NN)和朴素贝叶斯分类技术用于区分。我们通过纳入逼真的逐块相关性和偏相关结构来模拟全局非靶向代谢组学数据生成模拟数据,以模拟生物过程产生的相关性和代谢物聚类,从而对这些技术进行评估。在模拟研究中,协方差结构、均值和效应大小随机变化,以便在广泛的可能场景下提供分类器性能的一致估计。通过模拟评估了非正态误差分布的存在、生物和技术异常值的引入、表型分配不均衡、低于检测限的丰度导致的缺失值以及先验显著性过滤(降维)的影响。在每次模拟中,通过交叉验证优化分类器参数,如神经网络中的隐藏节点数量,以最小化由于分类器调整不当而检测到虚假结果的概率。然后使用不同样本介质、样本大小和实验设计的真实代谢组学数据集评估分类器性能。我们报告,在纳入非正态误差分布、表型分配不均衡、异常值、缺失值和降维的最逼真模拟研究中,分类器性能(误差从小到大)排名如下:支持向量机、随机森林、朴素贝叶斯、稀疏PLS - DA、神经网络、PLS - DA和k - NN分类器。当引入非正态误差分布时,相对于其他技术,PLS - DA和k - NN分类器的性能进一步恶化。在真实数据集上,观察到支持向量机和随机森林分类器性能表现更好的趋势。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f8c3/5488001/22f10afd513e/metabolites-07-00030-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验