• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用偏最小二乘判别分析进行组学数据分析时,交叉验证中的过度乐观:一项系统研究。

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study.

机构信息

Signal and Information Processing for Sensing Systems, Institute for Bioengineering of Catalonia, The Barcelona Institute for Science and Technology, Baldiri Reixac 4-8, 08028, Barcelona, Spain.

Department of Electronics and Biomedical Engineering, University of Barcelona, Martí i Franqués 1, 08028, Barcelona, Spain.

出版信息

Anal Bioanal Chem. 2018 Sep;410(23):5981-5992. doi: 10.1007/s00216-018-1217-1. Epub 2018 Jun 29.

DOI:10.1007/s00216-018-1217-1
PMID:29959482
Abstract

Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.

摘要

分析仪器的进步为同时检测数千个基因、肽或代谢物提供了可能。然而,成本和耗时的数据采集过程导致普遍缺乏样本。从数据分析的角度来看,组学数据的特点是维度高且样本数量少。在许多情况下,分析的目的是区分两种不同的条件或类别,即结合分析方法和使用数据集收集的可用示例定制的定性预测模型。为此,偏最小二乘判别分析(PLS-DA)经常用于组学研究。最近,人们越来越关注这种方法的不当使用,因为它容易过度拟合,并可能加剧错误发现的问题。在涉及少数受试者或样本的许多应用中,预测模型性能的估计仅基于交叉验证(CV)结果,强烈倾向于使用留一法(LOO)报告结果。PLS-DA 用于高维数据和小样本条件,以及弱验证方法的结合,是对模型性能进行不可靠估计的原因。在这项工作中,我们系统地研究了数据集大小、维度和用于 PLS-DA 的 CV 技术对模型性能的交叉验证时的过度拟合的影响。首先,我们使用从相同概率分布生成并具有指定随机二进制标签的合成数据,获得了一个真实分类率(CR)为 50%的数据集。正如预期的那样,我们的结果证实,内部验证会对分类准确性(即过度拟合)进行过度乐观的估计。我们已经根据使用的内部 CV 技术和样本与维度的比率,将 CR 估计器的特征描述为偏差和方差。在小样本条件下,由于估计器的偏差和方差较大,通常会出现极好的 CR。我们发现,当训练子集的样本量接近特征向量维度减一时,过度拟合达到峰值。在这些条件下,模型既不是欠定的也不是过定的,而是具有唯一的解。在这种情况下,LOO 效果更为强烈,并且在小样本条件下达到峰值。超出这一点,由于噪声的丰富性产生了正则化效应,从而导致模型不太复杂,过度拟合会减少。在过度拟合方面,我们的研究对 CV 方法进行了如下排名:Bootstrap 产生的 CR 估计器最准确,其次是 Bootstrapped Latin partitions、随机子采样、K-Fold,最后是非常流行的 LOO 提供的结果最差。从质谱和微阵列的真实数据集进一步证实了模拟结果。

相似文献

1
Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study.使用偏最小二乘判别分析进行组学数据分析时,交叉验证中的过度乐观:一项系统研究。
Anal Bioanal Chem. 2018 Sep;410(23):5981-5992. doi: 10.1007/s00216-018-1217-1. Epub 2018 Jun 29.
2
Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学:基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍
3
Discriminant analysis and feature selection in mass spectrometry imaging using constrained repeated random sampling - Cross validation (CORRS-CV).基于约束重复随机抽样的质谱成像判别分析和特征选择 - 交叉验证(CORRS-CV)。
Anal Chim Acta. 2020 Feb 8;1097:30-36. doi: 10.1016/j.aca.2019.10.039. Epub 2019 Oct 21.
4
A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.用于质谱数据分析的现代特征选择与分类方法的比较研究。
Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.
5
A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding.一篇教程综述:代谢组学与偏最小二乘判别分析——是权宜结合还是仓促结合。
Anal Chim Acta. 2015 Jun 16;879:10-23. doi: 10.1016/j.aca.2015.02.012. Epub 2015 Feb 11.
6
A consensus orthogonal partial least squares discriminant analysis (OPLS-DA) strategy for multiblock Omics data fusion.一种用于多组学数据融合的共识正交偏最小二乘判别分析(OPLS-DA)策略。
Anal Chim Acta. 2013 Mar 26;769:30-9. doi: 10.1016/j.aca.2013.01.022. Epub 2013 Jan 21.
7
Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类
BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.
8
Regularized Least Squares Cancer classifiers from DNA microarray data.基于DNA微阵列数据的正则化最小二乘癌症分类器。
BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-6-S4-S2.
9
Primal-dual for classification with rejection (PD-CR): a novel method for classification and feature selection-an application in metabolomics studies.基于拒绝的原始对偶分类法 (PD-CR):一种新的分类和特征选择方法 - 在代谢组学研究中的应用。
BMC Bioinformatics. 2021 Dec 15;22(1):594. doi: 10.1186/s12859-021-04478-w.
10
Approaches to Sample Size Determination for Multivariate Data: Applications to PCA and PLS-DA of Omics Data.多元数据样本量确定方法:在组学数据主成分分析和偏最小二乘判别分析中的应用
J Proteome Res. 2016 Aug 5;15(8):2379-93. doi: 10.1021/acs.jproteome.5b01029. Epub 2016 Jul 7.

引用本文的文献

1
Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets.基于质谱的多组学数据集标准化策略的评估
Metabolomics. 2025 Jul 1;21(4):98. doi: 10.1007/s11306-025-02297-1.
2
Oviposition-induced plant volatiles prime defences against impending herbivores in neighbouring non-damaged plants.产卵诱导的植物挥发物会增强邻近未受损植物对即将到来的食草动物的防御能力。
Sci Rep. 2025 May 20;15(1):17461. doi: 10.1038/s41598-025-02371-7.
3
Raman investigation of in vivo radiation exposure on melanin in murine hair.拉曼光谱法对小鼠毛发中黑色素的体内辐射暴露研究
PNAS Nexus. 2025 Apr 8;4(4):pgaf108. doi: 10.1093/pnasnexus/pgaf108. eCollection 2025 Apr.
4
Algorithms and tools for data-driven omics integration to achieve multilayer biological insights: a narrative review.用于数据驱动的组学整合以实现多层生物学见解的算法和工具:一篇综述
J Transl Med. 2025 Apr 10;23(1):425. doi: 10.1186/s12967-025-06446-x.
5
Signal Preprocessing in Instrument-Based Electronic Noses Leads to Parsimonious Predictive Models: Application to Olive Oil Quality Control.基于仪器的电子鼻中的信号预处理可生成简洁的预测模型:应用于橄榄油质量控制。
Sensors (Basel). 2025 Jan 25;25(3):737. doi: 10.3390/s25030737.
6
Artificial neural network detection of pancreatic cancer from proton (1H) magnetic resonance spectroscopy patterns of plasma metabolites.基于血浆代谢物质子(1H)磁共振波谱模式的人工神经网络对胰腺癌的检测
Commun Med (Lond). 2025 Jan 21;5(1):24. doi: 10.1038/s43856-024-00727-0.
7
Metabolic Phenotyping from Whole-Blood Responses to a Standardized Exercise Test May Discriminate for Physiological, Performance, and Illness Outcomes: A Pilot Study in Highly-Trained Cross-Country Skiers.基于全血对标准化运动测试反应的代谢表型分析可能有助于区分生理、运动表现和疾病结果:一项针对高水平越野滑雪运动员的初步研究
Sports Med Open. 2024 Sep 18;10(1):99. doi: 10.1186/s40798-024-00770-0.
8
NMR metabolomic modeling of age and lifespan: A multicohort analysis.基于多队列分析的 NMR 代谢组学建模与年龄和寿命的关系
Aging Cell. 2024 Jul;23(7):e14164. doi: 10.1111/acel.14164. Epub 2024 Apr 18.
9
Multi-omic integration of microbiome data for identifying disease-associated modules.多组学整合微生物组数据以识别与疾病相关的模块。
Nat Commun. 2024 Mar 23;15(1):2621. doi: 10.1038/s41467-024-46888-3.
10
NMR metabolomic modelling of age and lifespan: a multi-cohort analysis.年龄和寿命的核磁共振代谢组学建模:一项多队列分析。
medRxiv. 2023 Nov 8:2023.11.07.23298200. doi: 10.1101/2023.11.07.23298200.