• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用基于稳定性选择的变量选择方法的性能

Performance of variable selection methods using stability-based selection.

作者信息

Lu Danny, Weljie Aalim, de Leon Alexander R, McConnell Yarrow, Bathe Oliver F, Kopciuk Karen

机构信息

Sick Kids Research Institute, 555 University Avenue, Toronto, ON, M5G 1X8, Canada.

Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, 10-113 Translational Research Center, 3400 Civic Center Blvd, Bldg 421, Philadelphia, PA, 19104, USA.

出版信息

BMC Res Notes. 2017 Apr 4;10(1):143. doi: 10.1186/s13104-017-2461-8.

DOI:10.1186/s13104-017-2461-8
PMID:28376847
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5379604/
Abstract

BACKGROUND

Variable selection is frequently carried out during the analysis of many types of high-dimensional data, including those in metabolomics. This study compared the predictive performance of four variable selection methods using stability-based selection, a new secondary selection method that is implemented in the R package BioMark. Two of these methods were evaluated using the more well-known false discovery rate (FDR) as well.

RESULTS

Simulation studies varied factors relevant to biological data studies, with results based on the median values of 200 partial area under the receiver operating characteristic curve. There was no single top performing method across all factor settings, but the student t test based on stability selection or with FDR adjustment and the variable importance in projection (VIP) scores from partial least squares regression models obtained using a stability-based approach tended to perform well in most settings. Similar results were found with a real spiked-in metabolomics dataset. Group sample size, group effect size, number of significant variables and correlation structure were the most important factors whereas the percentage of significant variables was the least important.

CONCLUSIONS

Researchers can improve prediction scores for their study data by choosing VIP scores based on stability variable selection over the other approaches when the number of variables is small to modest and by increasing the number of samples even moderately. When the number of variables is high and there is block correlation amongst the significant variables (i.e., true biomarkers), the FDR-adjusted student t test performed best. The R package BioMark is an easy-to-use open-source program for variable selection that had excellent performance characteristics for the purposes of this study.

摘要

背景

在包括代谢组学数据在内的多种高维数据分析过程中,变量选择经常会被执行。本研究使用基于稳定性的选择方法比较了四种变量选择方法的预测性能,基于稳定性的选择是一种在R包BioMark中实现的新的二次选择方法。其中两种方法还使用了更为知名的错误发现率(FDR)进行评估。

结果

模拟研究改变了与生物数据研究相关的因素,结果基于200个受试者工作特征曲线下部分面积的中位数。在所有因素设置中没有单一的最佳方法,但基于稳定性选择或经FDR调整的学生t检验以及使用基于稳定性的方法获得的偏最小二乘回归模型中的投影变量重要性(VIP)得分在大多数设置下往往表现良好。在一个真实的加标代谢组学数据集中也发现了类似的结果。组样本量、组效应量、显著变量数量和相关结构是最重要的因素,而显著变量的百分比是最不重要的因素。

结论

当变量数量较少到适中时,研究人员通过基于稳定性变量选择选择VIP得分而不是其他方法,并适度增加样本数量,可以提高其研究数据的预测得分。当变量数量较多且显著变量之间存在组块相关性(即真正的生物标志物)时,经FDR调整的学生t检验表现最佳。R包BioMark是一个易于使用的开源变量选择程序,为本研究目的具有出色的性能特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/d9da1307029d/13104_2017_2461_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/1e7a6597d2f2/13104_2017_2461_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/cf4847649f72/13104_2017_2461_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/c8e1c1426ab5/13104_2017_2461_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/4e83345d2a1f/13104_2017_2461_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/d9da1307029d/13104_2017_2461_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/1e7a6597d2f2/13104_2017_2461_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/cf4847649f72/13104_2017_2461_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/c8e1c1426ab5/13104_2017_2461_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/4e83345d2a1f/13104_2017_2461_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/caf0/5379604/d9da1307029d/13104_2017_2461_Fig5_HTML.jpg

相似文献

1
Performance of variable selection methods using stability-based selection.使用基于稳定性选择的变量选择方法的性能
BMC Res Notes. 2017 Apr 4;10(1):143. doi: 10.1186/s13104-017-2461-8.
2
Model selection based on FDR-thresholding optimizing the area under the ROC-curve.基于错误发现率阈值化的模型选择,优化ROC曲线下面积。
Stat Appl Genet Mol Biol. 2009;8:Article31. doi: 10.2202/1544-6115.1462. Epub 2009 Jun 25.
3
A PAUC-based estimation technique for disease classification and biomarker selection.一种基于PAUC的疾病分类和生物标志物选择估计技术。
Stat Appl Genet Mol Biol. 2012 Oct 1;11(5):/j/sagmb.2012.11.issue-5/1544-6115.1792/1544-6115.1792.xml. doi: 10.1515/1544-6115.1792.
4
A novel strategy for rapidly and accurately screening biomarkers based on ultraperformance liquid chromatography-mass spectrometry metabolomics data.基于超高效液相色谱-质谱代谢组学数据的快速准确筛选生物标志物的新策略。
Anal Chim Acta. 2019 Jul 31;1063:47-56. doi: 10.1016/j.aca.2019.03.012. Epub 2019 Mar 12.
5
Evaluation of the effect of chance correlations on variable selection using Partial Least Squares-Discriminant Analysis.运用偏最小二乘判别分析评估机遇相关性对变量选择的影响。
Talanta. 2013 Nov 15;116:835-40. doi: 10.1016/j.talanta.2013.07.048. Epub 2013 Aug 9.
6
Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.通过实施单变量和OPLS统计分析的综合工作流程分析成年人类尿液代谢组随年龄、体重指数和性别的变化
J Proteome Res. 2015 Aug 7;14(8):3322-35. doi: 10.1021/acs.jproteome.5b00354. Epub 2015 Jul 2.
7
Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery.基于秩聚合的变量重要性分析及其在代谢组学生物标志物发现中的应用
Anal Chim Acta. 2016 Mar 10;911:27-34. doi: 10.1016/j.aca.2015.12.043. Epub 2016 Jan 7.
8
A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.用于质谱数据分析的现代特征选择与分类方法的比较研究。
Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.
9
Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法
Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.
10
High Dimensional Variable Selection with Error Control.具有误差控制的高维变量选择
Biomed Res Int. 2016;2016:8209453. doi: 10.1155/2016/8209453. Epub 2016 Aug 15.

本文引用的文献

1
Biomarker selection for medical diagnosis using the partial area under the ROC curve.利用ROC曲线下部分面积进行医学诊断的生物标志物选择。
BMC Res Notes. 2014 Jan 10;7:25. doi: 10.1186/1756-0500-7-25.
2
On use of partial area under the ROC curve for evaluation of diagnostic performance.ROC 曲线下面积的使用评估诊断性能。
Stat Med. 2013 Sep 10;32(20):3449-58. doi: 10.1002/sim.5777. Epub 2013 Mar 18.
3
Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images.
特征选择是否能提高分类准确性?使用解剖磁共振图像进行分类时,样本量和特征选择的影响。
Neuroimage. 2012 Mar;60(1):59-70. doi: 10.1016/j.neuroimage.2011.11.066. Epub 2011 Dec 1.
4
Stability-based biomarker selection.基于稳定性的生物标志物选择。
Anal Chim Acta. 2011 Oct 31;705(1-2):15-23. doi: 10.1016/j.aca.2011.01.039. Epub 2011 Feb 1.
5
A Selective Overview of Variable Selection in High Dimensional Feature Space.高维特征空间中变量选择的选择性概述
Stat Sin. 2010 Jan;20(1):101-148.
6
Chemometrics applications in biotech processes: a review.化学计量学在生物技术过程中的应用:综述。
Biotechnol Prog. 2011 Mar-Apr;27(2):307-15. doi: 10.1002/btpr.561. Epub 2011 Feb 28.
7
Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis.在全基因组关联分析中通过弹性网络变量选择进行多个遗传变异的联合识别。
Ann Hum Genet. 2010 Sep 1;74(5):416-28. doi: 10.1111/j.1469-1809.2010.00597.x. Epub 2010 Jul 14.
8
Chemometrics in metabolomics--a review in human disease diagnosis.代谢组学中的化学计量学——在人类疾病诊断中的综述
Anal Chim Acta. 2010 Feb 5;659(1-2):23-33. doi: 10.1016/j.aca.2009.11.042. Epub 2009 Nov 22.
9
The properties of high-dimensional data spaces: implications for exploring gene and protein expression data.高维数据空间的特性:对探索基因和蛋白质表达数据的启示
Nat Rev Cancer. 2008 Jan;8(1):37-49. doi: 10.1038/nrc2294.
10
The partial area under the summary ROC curve.汇总ROC曲线下的部分面积。
Stat Med. 2005 Jul 15;24(13):2025-40. doi: 10.1002/sim.2103.