Suppr超能文献

如何(不)使用机器学习生成高度可预测的生物标志物面板。

How (Not) to Generate a Highly Predictive Biomarker Panel Using Machine Learning.

机构信息

Department of Chemistry, University of Kansas, Lawrence, Kansas 66045, United States.

出版信息

J Proteome Res. 2022 Sep 2;21(9):2071-2074. doi: 10.1021/acs.jproteome.2c00117. Epub 2022 Aug 25.

Abstract

This review "teaches" researchers how to make their lackluster proteomics data look really impressive, by applying an inappropriate but pervasive strategy that selects features in a biased manner. The strategy is demonstrated and used to build a classification model with an accuracy of 92% and AUC of 0.98, while relying completely on random numbers for the data set. This "lesson" in data processing is not to be practiced by anyone; on the contrary, it is meant to be a cautionary tale showing that very unreliable results are obtained when a biomarker panel is generated first, using all the available data, and then tested by cross-validation. Data scientists describe the error committed in this scenario as having test data leak into the feature selection step, and it is currently a common mistake in proteomics biomarker studies that rely on machine learning. After the demonstration, advice is provided about how machine learning methods can be applied to proteomics data sets without generating artificially inflated accuracies.

摘要

这篇综述“教导”研究人员如何通过应用一种不恰当但普遍的策略,以有偏见的方式选择特征,使他们平庸的蛋白质组学数据看起来令人印象深刻。该策略被演示并用于构建一个准确率为 92%、AUC 为 0.98 的分类模型,而数据集完全依赖于随机数。这种数据处理“课程”不应该被任何人实践;相反,它旨在成为一个警示故事,表明当首先使用所有可用数据生成生物标志物面板,然后通过交叉验证进行测试时,会得到非常不可靠的结果。数据科学家将这种情况下犯的错误描述为测试数据泄露到特征选择步骤中,目前,依赖机器学习的蛋白质组学生物标志物研究中普遍存在这种错误。演示后,提供了关于如何在不产生人为夸大准确性的情况下将机器学习方法应用于蛋白质组学数据集的建议。

相似文献

6
Statistical data processing in clinical proteomics.临床蛋白质组学中的统计数据处理
J Chromatogr B Analyt Technol Biomed Life Sci. 2008 Apr 15;866(1-2):77-88. doi: 10.1016/j.jchromb.2007.10.042. Epub 2007 Nov 4.

引用本文的文献

4
Latest clinical frontiers related to autism diagnostic strategies.与自闭症诊断策略相关的最新临床前沿进展。
Cell Rep Med. 2025 Feb 18;6(2):101916. doi: 10.1016/j.xcrm.2024.101916. Epub 2025 Jan 28.
6
Comprehensive Overview of Bottom-Up Proteomics Using Mass Spectrometry.基于质谱的自下而上蛋白质组学综合概述
ACS Meas Sci Au. 2024 Jun 4;4(4):338-417. doi: 10.1021/acsmeasuresciau.3c00068. eCollection 2024 Aug 21.

本文引用的文献

5
Ten quick tips for machine learning in computational biology.计算生物学中机器学习的十条快速提示。
BioData Min. 2017 Dec 8;10:35. doi: 10.1186/s13040-017-0155-3. eCollection 2017.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验