单中心与多中心数据集在分子预后模型构建中的应用：一项模拟研究。

Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study.

机构信息

Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.

Department of Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany.

出版信息

Radiat Oncol. 2020 May 14;15(1):109. doi: 10.1186/s13014-020-01543-1.

DOI:10.1186/s13014-020-01543-1

PMID:32410693

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7227093/

Abstract

BACKGROUND

Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size.

METHODS

We set up a simulation study to quantify the predictive performance of a model trained and validated on different combinations of in silico single-center and multi-center data. The standard bioinformatical analysis workflow of batch correction, feature selection and parameter estimation was emulated. For the determination of model quality, four measures were used: false discovery rate, prediction error, chance of successful validation (significant correlation of predicted and true validation data outcome) and model calibration.

RESULTS

In agreement with literature about generalizability of signatures, prognostic models fitted to multi-center data consistently outperformed their single-center counterparts when the prediction error was the quality criterion of interest. However, for low signal strengths and small sample sizes, single-center discovery sets showed superior performance with respect to false discovery rate and chance of successful validation.

CONCLUSIONS

With regard to decision making, this simulation study underlines the importance of study aims being defined precisely a priori. Minimization of the prediction error requires multi-center discovery data, whereas single-center data are preferable with respect to false discovery rate and chance of successful validation when the expected signal or sample size is low. In contrast, the choice of validation data solely affects the quality of the estimator of the prediction error, which was more precise on multi-center validation data.

摘要

背景

基于从临床患者样本（如肿瘤组织或活检）生成的高维组学数据的预后模型，越来越多地用于预测放射治疗的成功。模型开发过程需要两个独立的发现和验证数据集。每个数据集可能包含在单个中心收集的样本或来自多个中心的样本集合。多中心数据往往比单中心数据更具异质性，但受潜在的特定于地点的偏差影响较小。为了实现研究的预期成功，最佳地利用发现和验证的有限数据资源需要冷静、客观的决策。在这项工作中，我们研究了选择单中心和多中心数据作为发现和验证数据集对模型预测性能的影响，并评估了这种影响如何取决于三个数据特征：信号强度、信息量特征的数量和样本量。

方法

我们设计了一个模拟研究，以量化在虚拟单中心和多中心数据的不同组合上训练和验证的模型的预测性能。模拟了批量校正、特征选择和参数估计的标准生物信息学分析工作流程。为了确定模型质量，使用了四个指标：假发现率、预测误差、成功验证的机会（预测和真实验证数据结果之间存在显著相关性）和模型校准。

结果

与关于签名通用性的文献一致，当感兴趣的质量标准是预测误差时，拟合多中心数据的预后模型始终优于其单中心对应模型。然而，对于低信号强度和小样本量，单中心发现集在假发现率和成功验证的机会方面表现出更好的性能。

结论

就决策而言，这项模拟研究强调了在事先明确定义研究目标的重要性。最小化预测误差需要多中心发现数据，而在预期信号或样本量较低时，单中心数据在假发现率和成功验证的机会方面更具优势。相比之下，验证数据的选择仅影响预测误差的估计质量，多中心验证数据的估计质量更精确。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0356/7227093/916d8f3bce5a/13014_2020_1543_Fig1_HTML.jpg

相似文献

Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study.单中心与多中心数据集在分子预后模型构建中的应用：一项模拟研究。

Radiat Oncol. 2020 May 14;15(1):109. doi: 10.1186/s13014-020-01543-1.

Predicting censored survival data based on the interactions between meta-dimensional omics data in breast cancer.基于乳腺癌元维度组学数据间的相互作用预测删失生存数据。

J Biomed Inform. 2015 Aug;56:220-8. doi: 10.1016/j.jbi.2015.05.019. Epub 2015 Jun 3.

Predictive test for chemotherapy response in resectable gastric cancer: a multi-cohort, retrospective analysis.可切除胃癌化疗反应的预测性检测：多队列、回顾性分析。

Lancet Oncol. 2018 May;19(5):629-638. doi: 10.1016/S1470-2045(18)30108-6. Epub 2018 Mar 19.

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.微阵列转录数据中存在许多准确的小判别特征子集：生物标志物发现。

BMC Bioinformatics. 2005 Apr 13;6:97. doi: 10.1186/1471-2105-6-97.

Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data.基于多组学数据预测卵巢癌生存的最小冗余最大相关性多视图特征选择。

BMC Med Genomics. 2018 Sep 14;11(Suppl 3):71. doi: 10.1186/s12920-018-0388-0.

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study.使用偏最小二乘判别分析进行组学数据分析时，交叉验证中的过度乐观：一项系统研究。

Anal Bioanal Chem. 2018 Sep;410(23):5981-5992. doi: 10.1007/s00216-018-1217-1. Epub 2018 Jun 29.

EMT network-based feature selection improves prognosis prediction in lung adenocarcinoma.基于 EMT 网络的特征选择可改善肺腺癌的预后预测。

PLoS One. 2019 Jan 31;14(1):e0204186. doi: 10.1371/journal.pone.0204186. eCollection 2019.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

Bias in error estimation when using cross-validation for model selection.在使用交叉验证进行模型选择时误差估计中的偏差。

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

引用本文的文献

Comment on: " Comparative outcomes of multi-port versus single-port robotic-assisted partial nephrectomy with same-day discharge: impact of surgical approach".评论：“多端口与单端口机器人辅助部分肾切除术同日出院的比较结果：手术方式的影响”

J Robot Surg. 2025 Aug 14;19(1):485. doi: 10.1007/s11701-025-02661-2.

The development and validation of a privacy-preserving model based on federated learning for diagnosing severe pediatric pneumonia.基于联邦学习的用于诊断小儿重症肺炎的隐私保护模型的开发与验证

Transl Pediatr. 2025 Jun 27;14(6):1287-1295. doi: 10.21037/tp-2025-349. Epub 2025 Jun 25.

Performance of Machine Learning in Diagnosing KRAS (Kirsten Rat Sarcoma) Mutations in Colorectal Cancer: Systematic Review and Meta-Analysis.机器学习在诊断结直肠癌KRAS（ Kirsten大鼠肉瘤）突变中的性能：系统评价和荟萃分析

J Med Internet Res. 2025 Jul 18;27:e73528. doi: 10.2196/73528.

Impact of Field-of-view Zooming and Segmentation Batches on Radiomics Features Reproducibility and Machine Learning Performance in Thyroid Scintigraphy.视野缩放和分割批次对甲状腺闪烁扫描中影像组学特征可重复性及机器学习性能的影响

Clin Nucl Med. 2025 Aug 1;50(8):683-694. doi: 10.1097/RLU.0000000000005995. Epub 2025 Jun 17.

Predictive Performance of Machine Learning for Suicide in Adolescents: Systematic Review and Meta-Analysis.机器学习对青少年自杀的预测性能：系统评价与荟萃分析

J Med Internet Res. 2025 Jun 16;27:e73052. doi: 10.2196/73052.

Haemolytic Anaemia-Related Pulmonary Hypertension.溶血性贫血相关性肺动脉高压

Life (Basel). 2024 Jul 14;14(7):876. doi: 10.3390/life14070876.

Limited Generalizability of Retrospective Single-Center Cohort Study in Comparison to Multicenter Cohort Study on Prognosis of Hepatocellular Carcinoma.与多中心队列研究相比，回顾性单中心队列研究在肝细胞癌预后方面的可推广性有限。

J Hepatocell Carcinoma. 2024 Jul 1;11:1235-1249. doi: 10.2147/JHC.S456093. eCollection 2024.

Integration of p16/HPV DNA Status with a 24-miRNA-Defined Molecular Phenotype Improves Clinically Relevant Stratification of Head and Neck Cancer Patients.将p16/HPV DNA状态与24种miRNA定义的分子表型相结合可改善头颈癌患者的临床相关分层。

Cancers (Basel). 2022 Jul 31;14(15):3745. doi: 10.3390/cancers14153745.

本文引用的文献

Combining clinical and molecular data in regression prediction models: insights from a simulation study.将临床和分子数据结合在回归预测模型中：一项模拟研究的见解。

Brief Bioinform. 2020 Dec 1;21(6):1904-1919. doi: 10.1093/bib/bbz136.

Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective.预测指标在不同环境下的变异性对预测模型性能的影响：测量误差的角度。

Stat Med. 2019 Aug 15;38(18):3444-3459. doi: 10.1002/sim.8183. Epub 2019 May 31.

A six-mRNA prognostic model to predict survival in head and neck squamous cell carcinoma.一种预测头颈部鳞状细胞癌生存率的六信使核糖核酸预后模型。

Cancer Manag Res. 2018 Dec 20;11:131-142. doi: 10.2147/CMAR.S185875. eCollection 2019.

German Cancer Consortium (DKTK) - A national consortium for translational cancer research.德国癌症研究联合会（DKTK）- 一个国家癌症转化研究联合会。

Mol Oncol. 2019 Mar;13(3):535-542. doi: 10.1002/1878-0261.12430. Epub 2019 Jan 9.

Radiogenomics.放射基因组学。

Med Phys. 2018 Nov;45(11):e1111-e1122. doi: 10.1002/mp.13064.

Molecular signature of response to preoperative radiotherapy in locally advanced breast cancer.局部晚期乳腺癌新辅助放疗疗效的分子标志物研究。

Radiat Oncol. 2018 Oct 1;13(1):193. doi: 10.1186/s13014-018-1129-4.

A Five-MicroRNA Signature Predicts Survival and Disease Control of Patients with Head and Neck Cancer Negative for HPV Infection.一种五 miRNA -signature 预测 HPV 感染阴性的头颈部癌症患者的生存和疾病控制。

Clin Cancer Res. 2019 Mar 1;25(5):1505-1516. doi: 10.1158/1078-0432.CCR-18-0776. Epub 2018 Aug 31.

Practice-changing radiation therapy trials for the treatment of cancer: where are we 150 years after the birth of Marie Curie?改变癌症治疗的放疗临床试验：在居里夫人诞辰 150 年后，我们处于什么位置？

Br J Cancer. 2018 Aug;119(4):389-407. doi: 10.1038/s41416-018-0201-z. Epub 2018 Jul 31.

Postoperative (chemo) radiation in patients with squamous cell cancers of the head and neck - clinical results from the cohort of the clinical cooperation group "Personalized Radiotherapy in Head and Neck Cancer".头颈部鳞状细胞癌患者的术后（化疗）放疗——来自临床合作组“头颈部癌症个体化放疗”队列的临床结果。

Radiat Oncol. 2018 Jul 3;13(1):123. doi: 10.1186/s13014-018-1067-1.

Twenty-gene-based prognostic model predicts lung adenocarcinoma survival.基于20个基因的预后模型预测肺腺癌生存率。

Onco Targets Ther. 2018 Jun 12;11:3415-3424. doi: 10.2147/OTT.S158638. eCollection 2018.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

单中心与多中心数据集在分子预后模型构建中的应用：一项模拟研究。

Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献