• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于非靶向代谢组学数据的早期预测生物标志物发现的特征选择方法。

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data.

机构信息

INRA, UMR1019, UNH-MAPPING Clermont-Ferrand, France.

INRA, UMR1019, Plateforme d'Exploration du Métabolisme Clermont-Ferrand, France.

出版信息

Front Mol Biosci. 2016 Jul 8;3:30. doi: 10.3389/fmolb.2016.00030. eCollection 2016.

DOI:10.3389/fmolb.2016.00030
PMID:27458587
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4937038/
Abstract

Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.

摘要

非靶向代谢组学是一种强大的表型工具,可用于更好地了解人类病理学发展中涉及的生物学机制,并识别早期预测性生物标志物。这种方法基于多种分析平台,如质谱(MS)、化学计量学和生物信息学,生成大量复杂的数据,需要进行适当的分析以提取有生物学意义的信息。尽管有各种可用的工具,但在没有过度拟合风险的情况下,处理个体数量有限的大型和嘈杂数据集仍然是一个挑战。此外,当目标是集中在识别临床结果的早期预测性生物标志物时,在发生前几年,使用适当的算法和工作流程来发现大量数据中的细微影响变得至关重要。在这种情况下,这项工作包括研究描述一般特征选择过程的工作流程,使用知识发现和数据挖掘方法来提出用于预测生物标志物发现的高级解决方案。该策略侧重于评估组合数值符号方法进行特征选择,目的是获得产生有效和准确预测模型的最佳代谢物组合。首先依赖于数值方法,特别是机器学习方法(SVM-RFE、RF、RF-RFE)和单变量统计分析(ANOVA),对原始代谢组学数据集和简化子集进行了比较研究。作为重采样方法,LOOCV 用于最小化过度拟合的风险。从这些不同方法的组合中获得的不同分数的重要性的最佳 k 个特征进行了比较,并使用形式概念分析确定了变量的稳定性。结果表明,RF-Gini 与 ANOVA 结合用于特征选择很有趣,因为这两种互补方法允许选择 48 个最佳预测候选物。在这个简化数据集上使用线性逻辑回归使我们能够获得最佳的预测准确性和假阳性数量的性能,模型包括 5 个顶级变量。因此,这些结果强调了特征选择方法的重要性以及针对从非靶向代谢组学数据中识别预测性生物标志物而处理简化数据集的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/ecd81f04f1dd/fmolb-03-00030-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/c73b1f1b37a8/fmolb-03-00030-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/705541cc978e/fmolb-03-00030-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/996088378226/fmolb-03-00030-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/2d9eb7ca950d/fmolb-03-00030-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/8c9d34bcd563/fmolb-03-00030-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/ecd81f04f1dd/fmolb-03-00030-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/c73b1f1b37a8/fmolb-03-00030-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/705541cc978e/fmolb-03-00030-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/996088378226/fmolb-03-00030-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/2d9eb7ca950d/fmolb-03-00030-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/8c9d34bcd563/fmolb-03-00030-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/963f/4937038/ecd81f04f1dd/fmolb-03-00030-g0006.jpg

相似文献

1
Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data.基于非靶向代谢组学数据的早期预测生物标志物发现的特征选择方法。
Front Mol Biosci. 2016 Jul 8;3:30. doi: 10.3389/fmolb.2016.00030. eCollection 2016.
2
Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学:基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍
3
A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.用于质谱数据分析的现代特征选择与分类方法的比较研究。
Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.
4
A random forest based biomarker discovery and power analysis framework for diagnostics research.基于随机森林的生物标志物发现和诊断研究功效分析框架。
BMC Med Genomics. 2020 Nov 23;13(1):178. doi: 10.1186/s12920-020-00826-6.
5
Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery.基于化学计量学的特征选择方法在早期癌症检测和生物标志物发现中的稳健性。
Stat Appl Genet Mol Biol. 2013 Mar 13;12(2):207-23. doi: 10.1515/sagmb-2012-0067.
6
SVM-RFE: selection and visualization of the most relevant features through non-linear kernels.SVM-RFE:通过非线性核选择和可视化最相关特征。
BMC Bioinformatics. 2018 Nov 19;19(1):432. doi: 10.1186/s12859-018-2451-4.
7
A Feature Selection Approach Guided an Early Prediction of Anthocyanin Accumulation Using Massive Untargeted Metabolomics Data in Mulberry.基于大量非靶向代谢组学数据的桑树花青苷积累早期预测的特征选择方法
Plant Cell Physiol. 2022 May 16;63(5):671-682. doi: 10.1093/pcp/pcac010.
8
An evolving computational platform for biological mass spectrometry: workflows, statistics and data mining with MASSyPup64.用于生物质谱分析的不断发展的计算平台:使用MASSyPup64的工作流程、统计学和数据挖掘
PeerJ. 2015 Nov 17;3:e1401. doi: 10.7717/peerj.1401. eCollection 2015.
9
A Conversation on Data Mining Strategies in LC-MS Untargeted Metabolomics: Pre-Processing and Pre-Treatment Steps.液相色谱-质谱联用非靶向代谢组学中的数据挖掘策略对话:预处理和前处理步骤
Metabolites. 2016 Nov 3;6(4):40. doi: 10.3390/metabo6040040.
10
Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery.基于秩聚合的变量重要性分析及其在代谢组学生物标志物发现中的应用
Anal Chim Acta. 2016 Mar 10;911:27-34. doi: 10.1016/j.aca.2015.12.043. Epub 2016 Jan 7.

引用本文的文献

1
Can we really predict the respiratory morbidity of preterm birth?我们真的能够预测早产的呼吸系统发病率吗?
Pediatr Res. 2025 Mar 18. doi: 10.1038/s41390-025-04012-1.
2
Untargeted Volatile Profiling Identifies Key Compounds Driving the Attraction of Western Flower Thrips to Cultivars.非靶向挥发性成分分析确定了驱动西花蓟马对不同品种产生吸引力的关键化合物。
Insects. 2025 Feb 16;16(2):216. doi: 10.3390/insects16020216.
3
Large-scale prospective serum metabolomic profiling reveals candidate predictive biomarkers for suspected preeclampsia patients.

本文引用的文献

1
A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding.一篇教程综述:代谢组学与偏最小二乘判别分析——是权宜结合还是仓促结合。
Anal Chim Acta. 2015 Jun 16;879:10-23. doi: 10.1016/j.aca.2015.02.012. Epub 2015 Feb 11.
2
Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics.代谢组学工作流程4:用于计算代谢组学的协作研究基础设施。
Bioinformatics. 2015 May 1;31(9):1493-5. doi: 10.1093/bioinformatics/btu813. Epub 2014 Dec 19.
3
Cohort Profile Update: The GAZEL Cohort Study.
大规模前瞻性血清代谢组学分析揭示了疑似子痫前期患者的候选预测生物标志物。
Sci Rep. 2025 Feb 9;15(1):4807. doi: 10.1038/s41598-025-87905-9.
4
Characterization of fine-flavor cocoa in parent-hybrid combinations using metabolomics approach.利用代谢组学方法对亲本-杂交组合中的优质风味可可进行表征。
Food Chem X. 2024 Sep 12;24:101832. doi: 10.1016/j.fochx.2024.101832. eCollection 2024 Dec 30.
5
Identification of novel mitophagy-related biomarkers for Kawasaki disease by integrated bioinformatics and machine-learning algorithms.通过综合生物信息学和机器学习算法鉴定川崎病新的线粒体自噬相关生物标志物。
Transl Pediatr. 2024 Aug 31;13(8):1439-1456. doi: 10.21037/tp-24-230. Epub 2024 Aug 26.
6
Artificial Intelligence in Metabolomics: A Current Review.代谢组学中的人工智能:当前综述
Trends Analyt Chem. 2024 Sep;178. doi: 10.1016/j.trac.2024.117852. Epub 2024 Jul 3.
7
Hybrid mRMR and multi-objective particle swarm feature selection methods and application to metabolomics of traditional Chinese medicine.混合最大相关最小冗余(mRMR)与多目标粒子群特征选择方法及其在中药代谢组学中的应用
PeerJ Comput Sci. 2024 May 31;10:e2073. doi: 10.7717/peerj-cs.2073. eCollection 2024.
8
Circulatory histidine levels as predictive indicators of disease activity in takayasu arteritis.循环中组氨酸水平作为大动脉炎疾病活动的预测指标。
Anal Sci Adv. 2021 Mar 20;2(11-12):527-535. doi: 10.1002/ansa.202000181. eCollection 2021 Dec.
9
Identification of CXCL16 as a diagnostic biomarker for obesity and intervertebral disc degeneration based on machine learning.基于机器学习的 CXCL16 作为肥胖和椎间盘退变诊断生物标志物的鉴定。
Sci Rep. 2023 Dec 3;13(1):21316. doi: 10.1038/s41598-023-48580-w.
10
DiffN Selection of Tandem Mass Spectrometry Precursors.差异 N 法选择串联质谱前体。
Anal Chem. 2023 Jun 27;95(25):9581-9588. doi: 10.1021/acs.analchem.3c01085. Epub 2023 Jun 13.
队列研究更新:GAZEL 队列研究。
Int J Epidemiol. 2015 Feb;44(1):77-77g. doi: 10.1093/ije/dyu224. Epub 2014 Nov 23.
4
Statistical analysis and modeling of mass spectrometry-based metabolomics data.基于质谱的代谢组学数据的统计分析与建模
Methods Mol Biol. 2014;1198:333-53. doi: 10.1007/978-1-4939-1258-2_22.
5
A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.用于质谱数据分析的现代特征选择与分类方法的比较研究。
Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.
6
Risk assessment tools for detecting those with pre-diabetes: a systematic review.用于检测糖尿病前期人群的风险评估工具:系统评价。
Diabetes Res Clin Pract. 2014 Jul;105(1):1-13. doi: 10.1016/j.diabres.2014.03.007. Epub 2014 Mar 18.
7
Merits of random forests emerge in evaluation of chemometric classifiers by external validation.随机森林在化学计量分类器的外部验证评估中的优势凸显。
Anal Chim Acta. 2013 Nov 1;801:22-33. doi: 10.1016/j.aca.2013.09.027. Epub 2013 Sep 23.
8
Human metabolomics: strategies to understand biology.人类代谢组学:理解生物学的策略。
Curr Opin Chem Biol. 2013 Oct;17(5):841-6. doi: 10.1016/j.cbpa.2013.06.015. Epub 2013 Jul 9.
9
Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection.随机森林在临床代谢组学中的表型判别和生物标志物选择。
Evid Based Complement Alternat Med. 2013;2013:298183. doi: 10.1155/2013/298183. Epub 2013 Feb 2.
10
Translational biomarker discovery in clinical metabolomics: an introductory tutorial.临床代谢组学中的转化生物标志物发现:入门教程
Metabolomics. 2013 Apr;9(2):280-299. doi: 10.1007/s11306-012-0482-9. Epub 2012 Dec 4.