Department of Statistical Science, Temple University.
Department of Biostatistics & Bioinformatics, Fox Chase Cancer Center, Temple University Health System, Philadelphia, PA, USA.
Bioinformatics. 2020 Jun 1;36(11):3409-3417. doi: 10.1093/bioinformatics/btaa161.
One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements.
We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods.
R code for the proposed methods is available at github.com/lburns27/Feature-Selection.
Supplementary data are available at Bioinformatics online.
大规模基因组研究的主要目标之一是识别对事件时间结局具有预后影响的基因,从而深入了解疾病过程。在过去二十年中,高通量基因组技术的快速发展使科学界能够监测数以万计的基因和蛋白质的表达水平,从而产生了大量的数据集,其中基因组特征的数量远远超过了研究对象的数量。基于单变量 Cox 回归的方法通常用于选择与生存结局相关的基因组特征;然而,Cox 模型假设比例风险(PH),这对于每个特征都不太可能成立。当应用于表现出某种形式的非比例风险(NPH)的基因组特征时,这些方法可能会导致对效应的低估或高估。我们提出了一系列广泛的边缘筛选技术,通过适应各种形式的 NPH 来帮助特征排名和选择。首先,我们基于 Kullback-Leibler 信息散度和 Yang-Prentice 模型开发了一种方法,该方法包括作为特例的 PH 和比例优势(PO)模型的方法。接下来,我们提出了用于 PH 和 PO 模型的 R2 度量,可以根据随机解释来解释。最后,我们提出了一个广义伪 R2 指数,它包括 PH、PO、交叉风险和交叉优势模型作为特例,可以解释为根据特征测量,经历事件和不经历事件的受试者之间的可分离性的百分比。
我们使用广泛的模拟研究和癌症基因组学中的公开数据集评估了我们的度量的性能。我们证明了所提出的方法成功地解决了基因组特征选择中的 NPH 问题,并优于现有方法。
所提出方法的 R 代码可在 github.com/lburns27/Feature-Selection 上获得。
补充数据可在 Bioinformatics 在线获得。