Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA.
Stat Med. 2013 May 30;32(12):2127-39. doi: 10.1002/sim.5694. Epub 2012 Dec 5.
The matched case-control designs are commonly used to control for potential confounding factors in genetic epidemiology studies especially epigenetic studies with DNA methylation. Compared with unmatched case-control studies with high-dimensional genomic or epigenetic data, there have been few variable selection methods for matched sets. In an earlier paper, we proposed the penalized logistic regression model for the analysis of unmatched DNA methylation data using a network-based penalty. However, for popularly applied matched designs in epigenetic studies that compare DNA methylation between tumor and adjacent non-tumor tissues or between pre-treatment and post-treatment conditions, applying ordinary logistic regression ignoring matching is known to bring serious bias in estimation. In this paper, we developed a penalized conditional logistic model using the network-based penalty that encourages a grouping effect of (1) linked Cytosine-phosphate-Guanine (CpG) sites within a gene or (2) linked genes within a genetic pathway for analysis of matched DNA methylation data. In our simulation studies, we demonstrated the superiority of using conditional logistic model over unconditional logistic model in high-dimensional variable selection problems for matched case-control data. We further investigated the benefits of utilizing biological group or graph information for matched case-control data. We applied the proposed method to a genome-wide DNA methylation study on hepatocellular carcinoma (HCC) where we investigated the DNA methylation levels of tumor and adjacent non-tumor tissues from HCC patients by using the Illumina Infinium HumanMethylation27 Beadchip. Several new CpG sites and genes known to be related to HCC were identified but were missed by the standard method in the original paper.
匹配病例对照设计通常用于控制遗传流行病学研究,特别是 DNA 甲基化的表观遗传学研究中的潜在混杂因素。与具有高维基因组或表观遗传数据的不匹配病例对照研究相比,针对匹配数据集的变量选择方法较少。在早期的一篇论文中,我们提出了一种基于网络惩罚的惩罚逻辑回归模型,用于分析不匹配的 DNA 甲基化数据。然而,对于表观遗传学研究中常用的匹配设计,即比较肿瘤和相邻非肿瘤组织之间或预处理和后处理条件之间的 DNA 甲基化,忽略匹配的普通逻辑回归已知会导致估计严重偏倚。在本文中,我们开发了一种基于网络惩罚的惩罚条件逻辑回归模型,该模型鼓励(1)基因内连接的胞嘧啶-磷酸-鸟嘌呤(CpG)位点或(2)遗传途径内连接的基因的分组效应,用于分析匹配的 DNA 甲基化数据。在我们的模拟研究中,我们证明了在高维变量选择问题中,使用条件逻辑回归模型优于无条件逻辑回归模型。我们进一步研究了利用生物学组或图形信息对匹配病例对照数据的益处。我们将所提出的方法应用于肝细胞癌(HCC)的全基因组 DNA 甲基化研究,其中我们使用 Illumina Infinium HumanMethylation27 Beadchip 研究了 HCC 患者肿瘤和相邻非肿瘤组织的 DNA 甲基化水平。鉴定了几个新的 CpG 位点和已知与 HCC 相关的基因,但在原始论文的标准方法中被遗漏了。