如何（不）使用机器学习生成高度可预测的生物标志物面板。

How (Not) to Generate a Highly Predictive Biomarker Panel Using Machine Learning.

机构信息

Department of Chemistry, University of Kansas, Lawrence, Kansas 66045, United States.

出版信息

J Proteome Res. 2022 Sep 2;21(9):2071-2074. doi: 10.1021/acs.jproteome.2c00117. Epub 2022 Aug 25.

DOI:10.1021/acs.jproteome.2c00117

PMID:36004690

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9680826/

Abstract

This review "teaches" researchers how to make their lackluster proteomics data look really impressive, by applying an inappropriate but pervasive strategy that selects features in a biased manner. The strategy is demonstrated and used to build a classification model with an accuracy of 92% and AUC of 0.98, while relying completely on random numbers for the data set. This "lesson" in data processing is not to be practiced by anyone; on the contrary, it is meant to be a cautionary tale showing that very unreliable results are obtained when a biomarker panel is generated first, using all the available data, and then tested by cross-validation. Data scientists describe the error committed in this scenario as having test data leak into the feature selection step, and it is currently a common mistake in proteomics biomarker studies that rely on machine learning. After the demonstration, advice is provided about how machine learning methods can be applied to proteomics data sets without generating artificially inflated accuracies.

摘要

这篇综述“教导”研究人员如何通过应用一种不恰当但普遍的策略，以有偏见的方式选择特征，使他们平庸的蛋白质组学数据看起来令人印象深刻。该策略被演示并用于构建一个准确率为 92%、AUC 为 0.98 的分类模型，而数据集完全依赖于随机数。这种数据处理“课程”不应该被任何人实践；相反，它旨在成为一个警示故事，表明当首先使用所有可用数据生成生物标志物面板，然后通过交叉验证进行测试时，会得到非常不可靠的结果。数据科学家将这种情况下犯的错误描述为测试数据泄露到特征选择步骤中，目前，依赖机器学习的蛋白质组学生物标志物研究中普遍存在这种错误。演示后，提供了关于如何在不产生人为夸大准确性的情况下将机器学习方法应用于蛋白质组学数据集的建议。

相似文献

How (Not) to Generate a Highly Predictive Biomarker Panel Using Machine Learning.如何（不）使用机器学习生成高度可预测的生物标志物面板。

J Proteome Res. 2022 Sep 2;21(9):2071-2074. doi: 10.1021/acs.jproteome.2c00117. Epub 2022 Aug 25.

A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data.一种从组学数据中识别具有生物学相关性且最小化的生物标志物组合的机器学习启发式方法。

BMC Genomics. 2015;16 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2164-16-S1-S2. Epub 2015 Jan 15.

Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者？

Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.

Diagnosis of T-cell-mediated kidney rejection by biopsy-based proteomic biomarkers and machine learning.基于活检的蛋白质组生物标志物和机器学习诊断 T 细胞介导的肾排斥反应。

Front Immunol. 2023 Feb 6;14:1090373. doi: 10.3389/fimmu.2023.1090373. eCollection 2023.

[Research progress of feature selection and machine learning methods for mass spectrometry-based protein biomarker discovery].基于质谱的蛋白质生物标志物发现的特征选择与机器学习方法研究进展

Sheng Wu Gong Cheng Xue Bao. 2019 Sep 25;35(9):1619-1632. doi: 10.13345/j.cjb.190064.

Statistical data processing in clinical proteomics.临床蛋白质组学中的统计数据处理

J Chromatogr B Analyt Technol Biomed Life Sci. 2008 Apr 15;866(1-2):77-88. doi: 10.1016/j.jchromb.2007.10.042. Epub 2007 Nov 4.

Impact of Machine Learning With Multiparametric Magnetic Resonance Imaging of the Breast for Early Prediction of Response to Neoadjuvant Chemotherapy and Survival Outcomes in Breast Cancer Patients.机器学习联合乳腺多参数磁共振成像对乳腺癌新辅助化疗早期疗效及生存预后评估的影响。

Invest Radiol. 2019 Feb;54(2):110-117. doi: 10.1097/RLI.0000000000000518.

Identification of candidate serum biomarkers of childhood-onset growth hormone deficiency using SWATH-MS and feature selection.应用 SWATH-MS 和特征选择鉴定儿童期起病生长激素缺乏症的候选血清生物标志物。

J Proteomics. 2018 Mar 20;175:105-113. doi: 10.1016/j.jprot.2018.01.003. Epub 2018 Jan 6.

Cross-validation and out-of-sample testing of physical activity intensity predictions with a wrist-worn accelerometer.腕戴加速度计的体力活动强度预测的交叉验证和样本外测试。

J Appl Physiol (1985). 2018 May 1;124(5):1284-1293. doi: 10.1152/japplphysiol.00760.2017. Epub 2018 Jan 25.

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

引用本文的文献

Serological proteomic characterization for monitoring liver fibrosis regression in chronic hepatitis B patients on treatment.用于监测慢性乙型肝炎治疗患者肝纤维化消退的血清蛋白质组学特征分析

Nat Commun. 2025 Aug 19;16(1):7714. doi: 10.1038/s41467-025-63006-z.

Metabolomics profiling identifies diagnostic metabolic signatures for pregnancy loss: a cross-sectional study from northwestern China.代谢组学分析确定了流产的诊断性代谢特征：一项来自中国西北部的横断面研究。

Front Endocrinol (Lausanne). 2025 Apr 10;16:1518043. doi: 10.3389/fendo.2025.1518043. eCollection 2025.

Groomed Fingerprint Sebum Sampling: Reproducibility and Variability According to Anatomical Collection Region and Biological Sex.修饰指纹皮脂采样：根据解剖采集区域和生物性别分析的可重复性和变异性

Molecules. 2025 Feb 6;30(3):726. doi: 10.3390/molecules30030726.

Latest clinical frontiers related to autism diagnostic strategies.与自闭症诊断策略相关的最新临床前沿进展。

Cell Rep Med. 2025 Feb 18;6(2):101916. doi: 10.1016/j.xcrm.2024.101916. Epub 2025 Jan 28.

Houston, We Have AI Problem! Quality Issues with Neuroimaging-Based Artificial Intelligence in Parkinson's Disease: A Systematic Review.休斯顿，我们遇到人工智能问题了！帕金森病中基于神经影像的人工智能的质量问题：一项系统综述。

Mov Disord. 2024 Dec;39(12):2130-2143. doi: 10.1002/mds.30002. Epub 2024 Sep 5.

Comprehensive Overview of Bottom-Up Proteomics Using Mass Spectrometry.基于质谱的自下而上蛋白质组学综合概述

ACS Meas Sci Au. 2024 Jun 4;4(4):338-417. doi: 10.1021/acsmeasuresciau.3c00068. eCollection 2024 Aug 21.

Skin Surface Sebum Analysis by ESI-MS.利用电喷雾质谱法进行皮肤表面皮脂分析。

Biomolecules. 2024 Jul 3;14(7):790. doi: 10.3390/biom14070790.

Metabolic Response to Small Molecule Therapy in Colorectal Cancer Tracked with Raman Spectroscopy and Metabolomics.拉曼光谱和代谢组学追踪结直肠癌中小分子治疗的代谢反应。

Angew Chem Int Ed Engl. 2024 Oct 21;63(43):e202410919. doi: 10.1002/anie.202410919. Epub 2024 Sep 5.

Are We There Yet? Assessing the Readiness of Single-Cell Proteomics to Answer Biological Hypotheses.我们到了吗？评估单细胞蛋白质组学回答生物学假设的准备情况。

J Proteome Res. 2025 Apr 4;24(4):1482-1492. doi: 10.1021/acs.jproteome.4c00091. Epub 2024 Jul 9.

Differentiation between descending thoracic aortic diseases using machine learning and plasma proteomic signatures.利用机器学习和血浆蛋白质组学特征鉴别胸降主动脉疾病

Clin Proteomics. 2024 Jun 2;21(1):38. doi: 10.1186/s12014-024-09487-4.

本文引用的文献

Exposing the Brain Proteomic Signatures of Alzheimer's Disease in Diverse Racial Groups: Leveraging Multiple Data Sets and Machine Learning.揭示不同种族群体中阿尔茨海默病的脑蛋白质组学特征：利用多个数据集和机器学习

J Proteome Res. 2022 Apr 1;21(4):1095-1104. doi: 10.1021/acs.jproteome.1c00966. Epub 2022 Mar 11.

Improved Discrimination of Disease States Using Proteomics Data with the Updated Aristotle Classifier.使用经过更新的 Aristotle 分类器的蛋白质组学数据提高疾病状态的区分能力。

J Proteome Res. 2021 May 7;20(5):2823-2829. doi: 10.1021/acs.jproteome.1c00066. Epub 2021 Apr 28.

Machine Learning Applications for Mass Spectrometry-Based Metabolomics.基于质谱的代谢组学的机器学习应用

Metabolites. 2020 Jun 13;10(6):243. doi: 10.3390/metabo10060243.

MALDI-Imaging for Classification of Epithelial Ovarian Cancer Histotypes from a Tissue Microarray Using Machine Learning Methods.使用机器学习方法通过组织微阵列对上皮性卵巢癌组织类型进行基质辅助激光解吸电离成像分类

Proteomics Clin Appl. 2019 Jan;13(1):e1700181. doi: 10.1002/prca.201700181. Epub 2018 Dec 14.

Ten quick tips for machine learning in computational biology.计算生物学中机器学习的十条快速提示。

BioData Min. 2017 Dec 8;10:35. doi: 10.1186/s13040-017-0155-3. eCollection 2017.

Addressing the challenge of defining valid proteomic biomarkers and classifiers.解决定义有效蛋白质组生物标志物和分类器的挑战。

BMC Bioinformatics. 2010 Dec 10;11:594. doi: 10.1186/1471-2105-11-594.

Comparison of feature selection and classification for MALDI-MS data.基质辅助激光解吸电离飞行时间质谱（MALDI-MS）数据的特征选择与分类比较

BMC Genomics. 2009 Jul 7;10 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2164-10-S1-S3.

Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.用于质谱和微阵列数据的递归支持向量机特征选择与样本分类

BMC Bioinformatics. 2006 Apr 10;7:197. doi: 10.1186/1471-2105-7-197.

Rules of evidence for cancer molecular-marker discovery and validation.癌症分子标志物发现与验证的证据规则。

Nat Rev Cancer. 2004 Apr;4(4):309-14. doi: 10.1038/nrc1322.

Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification.使用DNA微阵列数据进行诊断和预后分类时的陷阱。

J Natl Cancer Inst. 2003 Jan 1;95(1):14-8. doi: 10.1093/jnci/95.1.14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验