Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, USA.
Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA.
Bioinformatics. 2022 Sep 2;38(17):4078-4087. doi: 10.1093/bioinformatics/btac518.
The advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator-gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator-gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone.
We propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator-gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long noncoding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method.
The R package, related source codes and real datasets used in this article are provided at https://github.com/kehongjie/rPCor.
Supplementary data are available at Bioinformatics online.
高通量技术的进步描述了广泛的基因组中的各种表观遗传修饰和非编码 RNA,它们通过调节基因表达参与疾病的发病机制。表观遗传/非编码 RNA 和基因表达数据的高维性使得识别基因的重要调控因子具有挑战性。对每个可能的调控基因对进行单变量检验会受到严重的多重比较负担的影响,而直接应用正则化方法选择调控基因对在计算上是不可行的。在正则化之前应用快速筛选来降低维度比单独应用正则化方法更有效和稳定。
我们提出了一种基于稳健偏相关的新筛选方法,用于检测整个基因组中基因表达的表观遗传和非编码 RNA 调控因子,这是一个包含高维预测因子和高维响应的问题。与现有的筛选方法相比,我们的方法在概念上具有创新性,它降低了预测因子和响应的维度,并在节点(调控因子或基因)和边缘(调控因子-基因对)水平上进行筛选。我们开发了数据驱动的程序来确定条件集和最优筛选阈值,并实现了快速迭代算法。模拟和对肾细胞癌中长非编码 RNA 和 microRNA 调节以及胶质母细胞瘤多形性中 DNA 甲基化调节的应用说明了我们方法的有效性和优势。
本文中使用的 R 包、相关源代码和真实数据集可在 https://github.com/kehongjie/rPCor 上获得。
补充数据可在生物信息学在线获得。