Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany.
Chair of Statistics, School of Business and Economics, Humboldt-Universität zu Berlin, Spandauer Straße 1, Berlin, 10178, Germany.
BMC Med Res Methodol. 2019 Jul 24;19(1):162. doi: 10.1186/s12874-019-0802-0.
Omics data can be very informative in survival analysis and may improve the prognostic ability of classical models based on clinical risk factors for various diseases, for example breast cancer. Recent research has focused on integrating omics and clinical data, yet has often ignored the need for appropriate model building for clinical variables. Medical literature on classical prognostic scores, as well as biostatistical literature on appropriate model selection strategies for low dimensional (clinical) data, are often ignored in the context of omics research. The goal of this paper is to fill this methodological gap by investigating the added predictive value of gene expression data for models using varying amounts of clinical information.
We analyze two data sets from the field of survival prognosis of breast cancer patients. First, we construct several proportional hazards prediction models using varying amounts of clinical information based on established medical knowledge. These models are then used as a starting point (i.e. included as a clinical offset) for identifying informative gene expression variables using resampling procedures and penalized regression approaches (model based boosting and the LASSO). In order to assess the added predictive value of the gene signatures, measures of prediction accuracy and separation are examined on a validation data set for the clinical models and the models that combine the two sources of information.
For one data set, we do not find any substantial added predictive value of the omics data when compared to clinical models. On the second data set, we identify a noticeable added predictive value, however only for scenarios where little or no clinical information is included in the modeling process. We find that including more clinical information can lead to a smaller number of selected omics predictors.
New research using omics data should include all available established medical knowledge in order to allow an adequate evaluation of the added predictive value of omics data. Including all relevant clinical information in the analysis might also lead to more parsimonious models. The developed procedure to assess the predictive value of the omics data can be readily applied to other scenarios.
组学数据在生存分析中非常有用,并且可以提高基于临床危险因素的各种疾病(例如乳腺癌)的经典模型的预后能力。最近的研究集中在整合组学和临床数据上,但经常忽略了对临床变量进行适当模型构建的需求。在组学研究中,经常忽略医学文献中的经典预后评分以及生物统计学文献中关于低维(临床)数据的适当模型选择策略。本文的目的是通过研究使用不同数量的临床信息的基因表达数据对模型的预测价值来填补这一方法上的空白。
我们分析了乳腺癌患者生存预后领域的两个数据集。首先,我们根据已建立的医学知识,使用不同数量的临床信息构建了几个比例风险预测模型。然后,我们使用重采样程序和惩罚回归方法(基于模型的提升和 LASSO),将这些模型作为识别信息丰富的基因表达变量的起点(即包含作为临床偏移量)。为了评估基因特征的附加预测价值,我们在临床模型和结合两种信息源的模型的验证数据集上检查了预测准确性和分离的度量。
对于一个数据集,与临床模型相比,我们没有发现组学数据的任何实质性附加预测价值。在第二个数据集上,我们发现了明显的附加预测价值,但是仅在建模过程中包含很少或没有临床信息的情况下。我们发现,包含更多的临床信息会导致选择的组学预测因子数量减少。
使用组学数据的新研究应该包括所有可用的已建立的医学知识,以便能够充分评估组学数据的附加预测价值。在分析中包含所有相关的临床信息也可能导致更简洁的模型。开发的评估组学数据预测价值的程序可以很容易地应用于其他情况。