使用偏最小二乘回归模型的扩展方法对带有缺失值的删失大数据进行Cox模型拟合和交叉验证

Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models.

作者信息

Bertrand Frédéric, Maumy-Bertrand Myriam

机构信息

LIST3N, Université de Technologie de Troyes, Troyes, France.

IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, Strasbourg, France.

出版信息

Front Big Data. 2021 Nov 1;4:684794. doi: 10.3389/fdata.2021.684794. eCollection 2021.

DOI:10.3389/fdata.2021.684794

PMID:34790895

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8591675/

Abstract

Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that were able to fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing to fit Cox model for big data with missing values. When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme -to make efficient use of the death times of the left out data in relation to the death times of all the data. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two cross-validation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model. This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables -and even parsimony or group parsimony for Sparse partial least squares or sparse group SPLS based models, account for a common use of these extensions by statisticians who usually select their hyperparameters using cross-validation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional or big data context and often show poor statistical properties. We carried out a vast simulation study to evaluate more than a dozen of potential cross-validation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found cross-validation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques. In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted. The R-package used in this article is available on the CRAN, http://cran.r-project.org/web/packages/plsRcox/index.html. The R package bigPLS will soon be available on the CRAN and, until then, is available on Github https://github.com/fbertran/bigPLS.

摘要

在大数据环境中拟合Cox模型——在规模、强度和复杂性方面达到大规模，超出了常规分析工具的处理能力——通常具有挑战性。如果存在一些缺失数据，那就更加困难了。我们提出了一些算法，能够通过将偏最小二乘回归扩展到Cox模型，在高维设置下拟合Cox模型。其中一些算法能够处理缺失数据。最近，我们能够将最新算法扩展到大数据领域，从而能够为存在缺失值的大数据拟合Cox模型。在对标准或扩展的Cox模型进行交叉验证时，常用的标准是使用朴素或范霍韦林根方案的交叉验证偏对数似然——以便有效利用留出数据的死亡时间与所有数据的死亡时间的关系。相当令人惊讶的是，我们将通过一项涉及三种不同数据模拟算法的强大模拟研究表明，这两种交叉验证方法在将偏最小二乘回归扩展到Cox模型的直接或更复杂的扩展中均会失败。这是一个相当有趣的结果，至少有两个原因。首先，基于偏最小二乘的模型具有几个不错的特性，包括正则化、成分的可解释性、对缺失数据的支持、由于个体和变量的双标图实现的数据可视化——甚至对于基于稀疏偏最小二乘或稀疏组偏最小二乘的模型具有简约性或组简约性，这使得统计学家经常使用这些扩展，他们通常通过交叉验证来选择超参数。其次，它们几乎总是出现在基准研究中，以评估在高维或大数据环境中使用的新估计技术的性能，并且往往显示出较差的统计特性。我们进行了一项广泛的模拟研究，以评估十几种潜在的交叉验证标准，这些标准要么基于AUC，要么基于预测误差。其中一些标准能够导致选择合理数量的成分。使用这些新发现的交叉验证标准来拟合偏最小二乘回归到Cox模型的扩展，我们进行了一次基准重新分析，结果显示这些技术的性能有所提高。此外，我们提出了算法的稀疏组扩展，并基于施密德分数和最小绝对偏差的决定系数R定义了一种新的稳健度量：加权积分R施密德分数。本文中使用的R包可在CRAN上获取，网址为http://cran.r-project.org/web/packages/plsRcox/index.html。R包bigPLS很快将在CRAN上提供，在此之前，可在Github上获取，网址为https://github.com/fbertran/bigPLS。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4aa1/8591675/8648845cbfd9/fdata-04-684794-g001.jpg

相似文献

Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models.使用偏最小二乘回归模型的扩展方法对带有缺失值的删失大数据进行Cox模型拟合和交叉验证

Front Big Data. 2021 Nov 1;4:684794. doi: 10.3389/fdata.2021.684794. eCollection 2021.

Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data.基于偏差残差的稀疏偏最小二乘和稀疏核偏最小二乘回归用于删失数据。

Bioinformatics. 2015 Feb 1;31(3):397-404. doi: 10.1093/bioinformatics/btu660. Epub 2014 Oct 6.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Regularized estimation of large-scale gene association networks using graphical Gaussian models.基于图式高斯模型的大规模基因关联网络正则化估计

BMC Bioinformatics. 2009 Nov 24;10:384. doi: 10.1186/1471-2105-10-384.

Bayesian regression models outperform partial least squares methods for predicting milk components and technological properties using infrared spectral data.在使用红外光谱数据预测牛奶成分和工艺特性方面，贝叶斯回归模型优于偏最小二乘法。

J Dairy Sci. 2015 Nov;98(11):8133-51. doi: 10.3168/jds.2014-9143. Epub 2015 Sep 18.

Chemometrics-assisted simultaneous voltammetric determination of ascorbic acid, uric acid, dopamine and nitrite: application of non-bilinear voltammetric data for exploiting first-order advantage.化学计量学辅助同时伏安法测定抗坏血酸、尿酸、多巴胺和亚硝酸盐：利用非双线性伏安数据发挥一阶优势的应用

Talanta. 2014 Feb;119:553-63. doi: 10.1016/j.talanta.2013.11.028. Epub 2013 Nov 27.

MEPHAS: an interactive graphical user interface for medical and pharmaceutical statistical analysis with R and Shiny.MEPHAS：一个用于 R 和 Shiny 的医学和药物统计分析的交互式图形用户界面。

BMC Bioinformatics. 2020 May 11;21(1):183. doi: 10.1186/s12859-020-3494-x.

Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis.多标签分类的典范相关分析：最小二乘法公式、扩展及分析。

IEEE Trans Pattern Anal Mach Intell. 2011 Jan;33(1):194-200. doi: 10.1109/TPAMI.2010.160.

Sparse partial least-squares regression for high-throughput survival data analysis.用于高通量生存数据分析的稀疏偏最小二乘回归

Stat Med. 2013 Dec 30;32(30):5340-52. doi: 10.1002/sim.5975. Epub 2013 Sep 18.

Sparse partial least squares with group and subgroup structure.稀疏偏最小二乘与分组和子分组结构。

Stat Med. 2018 Oct 15;37(23):3338-3356. doi: 10.1002/sim.7821. Epub 2018 Jun 11.

引用本文的文献

Identification and analysis of diverse programmed cell death patterns in idiopathic pulmonary fibrosis using microarray-based transcriptome profiling and single-nucleus RNA sequencing.使用基于微阵列的转录组分析和单核RNA测序鉴定和分析特发性肺纤维化中多种程序性细胞死亡模式

Front Med (Lausanne). 2025 Jun 18;12:1534903. doi: 10.3389/fmed.2025.1534903. eCollection 2025.

PLASMA: Partial LeAst Squares for Multiomics Analysis.血浆：用于多组学分析的偏最小二乘法

Cancers (Basel). 2025 Jan 17;17(2):287. doi: 10.3390/cancers17020287.

An auxiliary strategy of partial least squares regression in pharmacokinetic/pharmacodynamic studies: A case of application of guhong injection in myocardial ischemia/reperfusion rats.药代动力学/药效学研究中的偏最小二乘回归辅助策略：以股红注射液在心肌缺血/再灌注大鼠中的应用为例。

J Food Drug Anal. 2024 Mar 15;32(1):79-102. doi: 10.38212/2224-6614.3492.

Machine learning algorithms for identifying predictive variables of mortality risk following dementia diagnosis: a longitudinal cohort study.机器学习算法在识别痴呆症诊断后死亡风险预测变量中的应用：一项纵向队列研究。

Sci Rep. 2023 Jun 10;13(1):9480. doi: 10.1038/s41598-023-36362-3.

Network connectivity and structural correlates of survival in progressive supranuclear palsy and corticobasal syndrome.进行性核上性麻痹和皮质基底节综合征的生存与网络连通性和结构相关性。

Hum Brain Mapp. 2023 Aug 1;44(11):4239-4255. doi: 10.1002/hbm.26342. Epub 2023 Jun 3.

本文引用的文献

Determining the number of components in PLS regression on incomplete data set.确定不完全数据集上偏最小二乘回归中的成分数量。

Stat Appl Genet Mol Biol. 2019 Nov 6;18(6):/j/sagmb.2019.18.issue-6/sagmb-2018-0059/sagmb-2018-0059.xml. doi: 10.1515/sagmb-2018-0059.

Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent.通过坐标下降法求解Cox比例风险模型的正则化路径

J Stat Softw. 2011 Mar;39(5):1-13. doi: 10.18637/jss.v039.i05.

Group and sparse group partial least square approaches applied in genomics context.分组和稀疏分组偏最小二乘法在基因组学中的应用。

Bioinformatics. 2016 Jan 1;32(1):35-42. doi: 10.1093/bioinformatics/btv535. Epub 2015 Sep 10.

Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data.基于偏差残差的稀疏偏最小二乘和稀疏核偏最小二乘回归用于删失数据。

Bioinformatics. 2015 Feb 1;31(3):397-404. doi: 10.1093/bioinformatics/btu660. Epub 2014 Oct 6.

When is hub gene selection better than standard meta-analysis?什么时候选择枢纽基因比标准荟萃分析更好？

PLoS One. 2013 Apr 17;8(4):e61505. doi: 10.1371/journal.pone.0061505. Print 2013.

survcomp: an R/Bioconductor package for performance assessment and comparison of survival models.survcomp：一个用于评估和比较生存模型性能的 R/Bioconductor 包。

Bioinformatics. 2011 Nov 15;27(22):3206-8. doi: 10.1093/bioinformatics/btr511. Epub 2011 Sep 7.

Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径

J Stat Softw. 2010;33(1):1-22.

A robust alternative to the schemper-henderson estimator of prediction error.一种预测误差的Schemper-Henderson估计量的稳健替代方法。

Biometrics. 2011 Jun;67(2):524-35. doi: 10.1111/j.1541-0420.2010.01459.x. Epub 2010 Jul 9.

Sparse partial least squares regression for simultaneous dimension reduction and variable selection.用于同时进行降维和变量选择的稀疏偏最小二乘回归。

J R Stat Soc Series B Stat Methodol. 2010 Jan;72(1):3-25. doi: 10.1111/j.1467-9868.2009.00723.x.

L1 penalized estimation in the Cox proportional hazards model.Cox比例风险模型中的L1惩罚估计

Biom J. 2010 Feb;52(1):70-84. doi: 10.1002/bimj.200900028.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用偏最小二乘回归模型的扩展方法对带有缺失值的删失大数据进行Cox模型拟合和交叉验证

Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献