Department of Chemistry, University of Kansas, Lawrence, Kansas 66045, United States.
J Proteome Res. 2022 Sep 2;21(9):2071-2074. doi: 10.1021/acs.jproteome.2c00117. Epub 2022 Aug 25.
This review "teaches" researchers how to make their lackluster proteomics data look really impressive, by applying an inappropriate but pervasive strategy that selects features in a biased manner. The strategy is demonstrated and used to build a classification model with an accuracy of 92% and AUC of 0.98, while relying completely on random numbers for the data set. This "lesson" in data processing is not to be practiced by anyone; on the contrary, it is meant to be a cautionary tale showing that very unreliable results are obtained when a biomarker panel is generated first, using all the available data, and then tested by cross-validation. Data scientists describe the error committed in this scenario as having test data leak into the feature selection step, and it is currently a common mistake in proteomics biomarker studies that rely on machine learning. After the demonstration, advice is provided about how machine learning methods can be applied to proteomics data sets without generating artificially inflated accuracies.
这篇综述“教导”研究人员如何通过应用一种不恰当但普遍的策略,以有偏见的方式选择特征,使他们平庸的蛋白质组学数据看起来令人印象深刻。该策略被演示并用于构建一个准确率为 92%、AUC 为 0.98 的分类模型,而数据集完全依赖于随机数。这种数据处理“课程”不应该被任何人实践;相反,它旨在成为一个警示故事,表明当首先使用所有可用数据生成生物标志物面板,然后通过交叉验证进行测试时,会得到非常不可靠的结果。数据科学家将这种情况下犯的错误描述为测试数据泄露到特征选择步骤中,目前,依赖机器学习的蛋白质组学生物标志物研究中普遍存在这种错误。演示后,提供了关于如何在不产生人为夸大准确性的情况下将机器学习方法应用于蛋白质组学数据集的建议。