School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China.
Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China.
Biom J. 2022 Mar;64(3):461-480. doi: 10.1002/bimj.202000157. Epub 2021 Nov 1.
In high-throughput cancer studies, gene-environment interactions associated with outcomes have important implications. Some commonly adopted identification methods do not respect the "main effect, interaction" hierarchical structure. In addition, they can be challenged by data contamination and/or long-tailed distributions, which are not uncommon. In this article, robust methods based on -divergence and density power divergence are proposed to accommodate contaminated data/long-tailed distributions. A hierarchical sparse group penalty is adopted for regularized estimation and selection and can identify important gene-environment interactions and respect the "main effect, interaction" hierarchical structure. The proposed methods are implemented using an effective group coordinate descent algorithm. Simulation shows that when contamination occurs, the proposed methods can significantly outperform the existing alternatives with more accurate identification. The proposed approach is applied to the analysis of The Cancer Genome Atlas (TCGA) triple-negative breast cancer data and Gene Environment Association Studies (GENEVA) Type 2 Diabetes data.
在高通量癌症研究中,与结局相关的基因-环境相互作用具有重要意义。一些常用的识别方法不尊重“主效应,交互”层次结构。此外,它们可能会受到数据污染和/或长尾分布的挑战,这并不罕见。本文提出了基于 -散度和密度幂散度的稳健方法来适应污染数据/长尾分布。采用层次稀疏组惩罚进行正则化估计和选择,可以识别重要的基因-环境相互作用,并尊重“主效应,交互”层次结构。所提出的方法使用有效的组坐标下降算法实现。仿真表明,当发生污染时,所提出的方法可以通过更准确的识别显著优于现有替代方法。所提出的方法应用于分析癌症基因组图谱(TCGA)三阴性乳腺癌数据和基因环境关联研究(GENEVA)2 型糖尿病数据。