Ohio State University.
Brief Bioinform. 2021 Jan 18;22(1):334-345. doi: 10.1093/bib/bbaa007.
Many high-throughput genomic applications involve a large set of potential covariates and a response which is frequently measured on an ordinal scale, and it is crucial to identify which variables are truly associated with the response. Effectively controlling the false discovery rate (FDR) without sacrificing power has been a major challenge in variable selection research. This study reviews two existing variable selection frameworks, model-X knockoffs and a modified version of reference distribution variable selection (RDVS), both of which utilize artificial variables as benchmarks for decision making. Model-X knockoffs constructs a 'knockoff' variable for each covariate to mimic the covariance structure, while RDVS generates only one null variable and forms a reference distribution by performing multiple runs of model fitting. Herein, we describe how different importance measures for ordinal responses can be constructed that fit into these two selection frameworks, using either penalized regression or machine learning techniques. We compared these measures in terms of the FDR and power using simulated data. Moreover, we applied these two frameworks to high-throughput methylation data for identifying features associated with the progression from normal liver tissue to hepatocellular carcinoma to further compare and contrast their performances.
许多高通量基因组应用涉及一大组潜在的协变量和一个通常是有序尺度测量的响应,确定哪些变量与响应真正相关是至关重要的。在不牺牲功效的情况下有效地控制假发现率(FDR)一直是变量选择研究中的主要挑战。本研究综述了两种现有的变量选择框架,即模型-X 置换和修改后的参考分布变量选择(RDVS),它们都利用人工变量作为决策的基准。模型-X 置换为每个协变量构建一个“置换”变量来模拟协方差结构,而 RDVS 只生成一个空变量,并通过多次模型拟合来形成参考分布。在此,我们描述了如何使用惩罚回归或机器学习技术,为这两个选择框架构建适合有序响应的不同重要性度量。我们使用模拟数据从 FDR 和功效两方面比较了这些度量。此外,我们将这两个框架应用于高通量甲基化数据,以识别与正常肝组织向肝细胞癌进展相关的特征,从而进一步比较和对比它们的性能。