Wu Cen, Zhang Qingzhao, Jiang Yu, Ma Shuangge
Department of Statistics, Kansas State University, Manhattan, KS, 66506, USA.
School of Economics and the Wang Yanan Institute for Studies in Economics, Xiamen University.
J Multivar Anal. 2018 Nov;168:119-130. doi: 10.1016/j.jmva.2018.06.009. Epub 2018 Jul 10.
With its important biological implications, modeling the associations of gene expression (GE) and copy number variation (CNV) has been extensively conducted. Such analysis is challenging because of the high data dimensionality, lack of knowledge regulating CNVs for a specific GE, different behaviors of the -acting and -acting CNVs, possible long-tailed distributions and contamination of GE measurements, and correlations between CNVs. The existing methods fail to address one or more of these challenges. In this study, a new method is developed to model more effectively the GE-CNV associations. Specifically, for each GE, a partially linear model, with a nonlinear -acting CNV effect, is assumed. A robust loss function is adopted to accommodate long-tailed distributions and data contamination. We adopt penalization to accommodate the high dimensionality and identify relevant CNVs. A network structure is introduced to accommodate the correlations among CNVs. The proposed method comprehensively accommodates multiple challenging characteristics of GE-CNV modeling and effectively overcomes the limitations of existing methods. We develop an effective computational algorithm and rigorously establish the consistency properties. Simulation shows the superiority of the proposed method over alternatives. The TCGA (The Cancer Genome Atlas) data on the PCD (programmed cell death) pathway are analyzed, and the proposed method has improved prediction and stability and biologically plausible findings.
鉴于其重要的生物学意义,对基因表达(GE)与拷贝数变异(CNV)之间的关联进行建模已得到广泛开展。由于数据维度高、缺乏针对特定GE调控CNV的知识、顺式作用和反式作用CNV的不同行为、GE测量可能存在的长尾分布和数据污染以及CNV之间的相关性,此类分析具有挑战性。现有方法无法解决这些挑战中的一个或多个。在本研究中,开发了一种新方法以更有效地对GE-CNV关联进行建模。具体而言,对于每个GE,假定一个具有非线性反式作用CNV效应的部分线性模型。采用稳健损失函数以适应长尾分布和数据污染。我们采用惩罚来适应高维度并识别相关的CNV。引入网络结构以适应CNV之间的相关性。所提出的方法全面考虑了GE-CNV建模的多个具有挑战性的特征,并有效克服了现有方法的局限性。我们开发了一种有效的计算算法并严格确立了一致性属性。模拟结果表明所提出的方法优于其他方法。对癌症基因组图谱(TCGA)中关于程序性细胞死亡(PCD)途径的数据进行了分析,所提出的方法具有更好的预测性和稳定性以及生物学上合理的发现。