Chai Hao, Shi Xingjie, Zhang Qingzhao, Zhao Qing, Huang Yuan, Ma Shuangge
Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America.
Department of Statistics, Nanjing University of Finance and Economics, Nanjing Shi, Jiangsu Sheng, China.
Genet Epidemiol. 2017 Dec;41(8):779-789. doi: 10.1002/gepi.22066. Epub 2017 Sep 14.
Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.
基因表达(GE)研究在癌症研究中一直发挥着关键作用。尽管付出了巨大努力,但由于信号微弱和数据维度高,分析结果往往仍不尽人意。结果变量的长尾分布常常给分析带来进一步挑战。在最近的多维研究中,已经收集了关于基因表达及其调控因子(如拷贝数改变(CNA)、甲基化和微小RNA)的数据,这些数据可以提供关于基因表达与癌症结果之间关联的额外信息。在本研究中,我们开发了一种ARMI(辅助稳健标记识别)方法,用于分析包含基因表达及其调控因子测量值的癌症研究。所提出的方法从调控因子中借用信息,比单独分析基因表达数据更有效。采用稳健的目标函数来适应长尾分布。通过惩罚有效地实现标记识别。所提出的方法具有直观的公式,并且计算成本低得多。模拟显示了其在各种设置下的令人满意的性能。对来自癌症基因组图谱(TCGA)的黑色素瘤和肺癌数据进行了分析,这导致了生物学上合理的标记识别和卓越的预测。