• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

超高维数据中筛选和变量选择方法在预测事件发生时间结局方面的综合性能

Combined Performance of Screening and Variable Selection Methods in Ultra-High Dimensional Data in Predicting Time-To-Event Outcomes.

作者信息

Pi Lira, Halabi Susan

机构信息

Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC 27710.

出版信息

Diagn Progn Res. 2018;2. doi: 10.1186/s41512-018-0043-4. Epub 2018 Sep 26.

DOI:10.1186/s41512-018-0043-4
PMID:30393771
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6214199/
Abstract

BACKGROUND

Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS) and principled SIS (PSIS) have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables.

METHODS

We conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO) and random survival forest (RSF), in ultra-high dimensional setting data for the purpose of developing prognostic models for a time-to-event outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell's concordance statistic), calibration and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer.

RESULTS

When n=300, ISIS-LASSO and ISIS-ALASSO chose all the informative variables which resulted in the highest Harrell's c-index (>0.80). On the other hand, with a small sample size (n=150), ALASSO performed better than any other combinations as demonstrated by the highest c-index and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest discrimination with c-index of 0.67.

CONCLUSIONS

Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/4fdfe206bc2b/41512_2018_43_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/b2897343f372/41512_2018_43_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/eeb906835a86/41512_2018_43_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/4fdfe206bc2b/41512_2018_43_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/b2897343f372/41512_2018_43_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/eeb906835a86/41512_2018_43_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e298/6460519/4fdfe206bc2b/41512_2018_43_Fig3_HTML.jpg
摘要

背景

构建临床结局的预后模型是一项日益重要的研究任务,并且仍将是基因组医学中的一个关键领域。临床结局的预后模型通常利用变量选择方法和机器学习工具来构建和验证。然而,在超高维空间中的挑战不仅在于降低数据的维度,还在于保留预测结局的重要变量。已经开发了筛选方法,如确定性独立筛选(SIS)、迭代SIS(ISIS)和原则性SIS(PSIS)来克服高维性的挑战。我们有兴趣识别重要的单核苷酸多态性(SNP),并将它们整合到转移性前列腺癌患者总生存的经过验证的预后模型中。虽然上述变量选择方法在选择SNP方面有理论依据,但这些组合方法在预测事件发生时间结局方面的比较和性能此前尚未在具有数十万变量的超高维空间中进行研究。

方法

我们进行了一系列模拟,以比较变量选择方法和分类树的不同组合,如最小绝对收缩和选择算子(LASSO)、自适应最小绝对收缩和选择算子(ALASSO)以及随机生存森林(RSF)在超高维设置数据中的性能,目的是为受删失影响的事件发生时间结局开发预后模型。对变量选择方法进行了区分度(Harrell一致性统计量)、校准和整体性能的评估。此外,我们将这些方法应用于来自623名白种人前列腺癌患者的498,081个SNP。

结果

当n = 300时,ISIS - LASSO和ISIS - ALASSO选择了所有信息变量,从而获得了最高的Harrell c指数(> 0.80)。另一方面,在小样本量(n = 150)时,尽管有过拟合的证据,但ALASSO的表现优于任何其他组合,这体现在最高的c指数和/或整体性能上。在分析前列腺癌数据时,ISIS - ALASSO、SIS - LASSO和SIS - ALASSO组合实现了最高的区分度,c指数为0.67。

结论

选择合适的变量选择方法来训练模型是开发稳健预后模型的关键步骤。基于模拟研究,有效使用ALASSO或方法组合,如ISIS - LASSO和ISIS - ALASSO,在中等样本量的情况下,既能开发出具有高预测准确性且过拟合风险低的预后模型。

相似文献

1
Combined Performance of Screening and Variable Selection Methods in Ultra-High Dimensional Data in Predicting Time-To-Event Outcomes.超高维数据中筛选和变量选择方法在预测事件发生时间结局方面的综合性能
Diagn Progn Res. 2018;2. doi: 10.1186/s41512-018-0043-4. Epub 2018 Sep 26.
2
Omics feature selection with the extended SIS R package: identification of a body mass index epigenetic multimarker in the Strong Heart Study.使用扩展的 SIS R 包进行组学特征选择:在 Strong Heart 研究中鉴定出体重指数的表观遗传多标记物。
Am J Epidemiol. 2024 Jul 8;193(7):1010-1018. doi: 10.1093/aje/kwae006.
3
Optimizing Prognostic Predictions in Liver Cancer with Machine Learning and Survival Analysis.利用机器学习和生存分析优化肝癌的预后预测
Entropy (Basel). 2024 Sep 7;26(9):767. doi: 10.3390/e26090767.
4
A comparative study of variable selection methods in the context of developing psychiatric screening instruments.在开发精神科筛查工具的背景下,对变量选择方法的比较研究。
Stat Med. 2014 Feb 10;33(3):401-21. doi: 10.1002/sim.5937. Epub 2013 Aug 11.
5
Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies.用于多位点全基因组关联研究的迭代确定独立筛选EM-贝叶斯套索算法
PLoS Comput Biol. 2017 Jan 31;13(1):e1005357. doi: 10.1371/journal.pcbi.1005357. eCollection 2017 Jan.
6
Dementia risk prediction in individuals with mild cognitive impairment: a comparison of Cox regression and machine learning models.轻度认知障碍个体的痴呆风险预测:Cox 回归和机器学习模型的比较。
BMC Med Res Methodol. 2022 Nov 2;22(1):284. doi: 10.1186/s12874-022-01754-y.
7
Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening.具有肿瘤特征分析的套索惩罚 Cox 模型的预后可提高预测准确性,优于仅使用临床数据的预测,并且受益于二维预筛选。
BMC Cancer. 2022 Oct 5;22(1):1045. doi: 10.1186/s12885-022-10117-1.
8
Variable selection for proportional odds model.比例优势模型的变量选择
Stat Med. 2007 Sep 10;26(20):3771-81. doi: 10.1002/sim.2833.
9
A Comparison Study of Machine Learning (Random Survival Forest) and Classic Statistic (Cox Proportional Hazards) for Predicting Progression in High-Grade Glioma after Proton and Carbon Ion Radiotherapy.机器学习(随机生存森林)与经典统计学(Cox比例风险模型)预测质子和碳离子放疗后高级别胶质瘤进展的比较研究
Front Oncol. 2020 Oct 30;10:551420. doi: 10.3389/fonc.2020.551420. eCollection 2020.
10
A comparative study of forest methods for time-to-event data: variable selection and predictive performance.森林方法在生存时间数据中的比较研究:变量选择和预测性能。
BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

引用本文的文献

1
Prediction of Prostate Cancer Risk Stratification Based on A Nonlinear Transformation Stacking Learning Strategy.基于非线性变换堆叠学习策略的前列腺癌风险分层预测
Int Neurourol J. 2024 Mar;28(1):33-43. doi: 10.5213/inj.2346332.166. Epub 2024 Mar 31.
2
Development and Validation of a Risk Prediction Model for Ketosis-Prone Type 2 Diabetes Mellitus Among Patients Newly Diagnosed with Type 2 Diabetes Mellitus in China.中国新诊断2型糖尿病患者中酮症倾向2型糖尿病风险预测模型的开发与验证
Diabetes Metab Syndr Obes. 2023 Aug 18;16:2491-2502. doi: 10.2147/DMSO.S424267. eCollection 2023.
3
Gene Screening in High-Throughput Right-Censored Lung Cancer Data.

本文引用的文献

1
Identification of a Genomic Region between and Associated with Risk of Bevacizumab-Induced Hypertension: CALGB 80405 (Alliance).与贝伐珠单抗诱导性高血压风险相关的 和 之间基因组区域的鉴定:CALGB 80405(Alliance)。
Clin Cancer Res. 2018 Oct 1;24(19):4734-4744. doi: 10.1158/1078-0432.CCR-17-1523. Epub 2018 Jun 5.
2
High Dimensional Variable Selection with Error Control.具有误差控制的高维变量选择
Biomed Res Int. 2016;2016:8209453. doi: 10.1155/2016/8209453. Epub 2016 Aug 15.
3
Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent.
高通量右删失肺癌数据中的基因筛查
Onco (Basel). 2022 Dec;2(4):305-318. doi: 10.3390/onco2040017. Epub 2022 Oct 17.
4
Penalized weighted proportional hazards model for robust variable selection and outlier detection.惩罚加权比例风险模型用于稳健变量选择和异常值检测。
Stat Med. 2022 Jul 30;41(17):3398-3420. doi: 10.1002/sim.9424. Epub 2022 May 17.
5
A 4-gene signature predicts prognosis of uterine serous carcinoma.一个四基因标志物可预测子宫浆液性癌的预后。
BMC Cancer. 2021 Feb 12;21(1):154. doi: 10.1186/s12885-021-07834-4.
6
Towards biomarkers for outcomes after pancreatic ductal adenocarcinoma and ischaemic stroke, with focus on (co)-morbidity and ageing/cellular senescence (SASKit): protocol for a prospective cohort study.旨在寻找胰腺导管腺癌和缺血性中风后结局的生物标志物,重点关注(共)病和衰老/细胞衰老(SASKit):一项前瞻性队列研究的方案。
BMJ Open. 2020 Dec 17;10(12):e039560. doi: 10.1136/bmjopen-2020-039560.
7
Developing and Validating Risk Assessment Models of Clinical Outcomes in Modern Oncology.现代肿瘤学中临床结局风险评估模型的开发与验证
JCO Precis Oncol. 2019;3. doi: 10.1200/PO.19.00068. Epub 2019 Oct 24.
通过坐标下降法求解Cox比例风险模型的正则化路径
J Stat Softw. 2011 Mar;39(5):1-13. doi: 10.18637/jss.v039.i05.
4
Validation of a Genomic Classifier for Predicting Post-Prostatectomy Recurrence in a Community Based Health Care Setting.基于社区医疗保健环境下用于预测前列腺癌术后复发的基因组分类器的验证。
J Urol. 2016 Jun;195(6):1748-53. doi: 10.1016/j.juro.2015.11.044. Epub 2015 Nov 26.
5
Prospective Validation of a 21-Gene Expression Assay in Breast Cancer.21基因表达检测法在乳腺癌中的前瞻性验证
N Engl J Med. 2015 Nov 19;373(21):2005-14. doi: 10.1056/NEJMoa1510764. Epub 2015 Sep 27.
6
Tissue-based Genomics Augments Post-prostatectomy Risk Stratification in a Natural History Cohort of Intermediate- and High-Risk Men.基于组织的基因组学在中高危男性自然史队列中增强了前列腺切除术后风险分层。
Eur Urol. 2016 Jan;69(1):157-65. doi: 10.1016/j.eururo.2015.05.042. Epub 2015 Jun 6.
7
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.透明报告个体预后或诊断的多变量预测模型(TRIPOD):解释和说明。
Ann Intern Med. 2015 Jan 6;162(1):W1-73. doi: 10.7326/M14-0698.
8
Feature Screening via Distance Correlation Learning.通过距离相关学习进行特征筛选
J Am Stat Assoc. 2012 Jul 1;107(499):1129-1139. doi: 10.1080/01621459.2012.695654.
9
Combined value of validated clinical and genomic risk stratification tools for predicting prostate cancer mortality in a high-risk prostatectomy cohort.用于预测高危前列腺切除队列中前列腺癌死亡率的经验证的临床和基因组风险分层工具的综合价值。
Eur Urol. 2015 Feb;67(2):326-33. doi: 10.1016/j.eururo.2014.05.039. Epub 2014 Jul 2.
10
Updated prognostic model for predicting overall survival in first-line chemotherapy for patients with metastatic castration-resistant prostate cancer.一线化疗治疗转移性去势抵抗性前列腺癌患者的总生存期预测的更新预后模型。
J Clin Oncol. 2014 Mar 1;32(7):671-7. doi: 10.1200/JCO.2013.52.3696. Epub 2014 Jan 21.