Stingo Francesco C, Swartz Michael D, Vannucci Marina
Department of Biostatistics, MD Anderson Cancer Center, 1400 Pressler St. Houston, TX 77030, USA.
Department of Biostatistics, UT School of Public Health, 1200 Pressler St. Houston, TX 77030, USA.
Stat Interface. 2015;8(2):137-151. doi: 10.4310/SII.2015.v8.n2.a2.
Complex diseases, such as cancer, arise from complex etiologies consisting of multiple single-nucleotide polymorphisms (SNPs), each contributing a small amount to the overall risk of disease. Thus, many researchers have gone beyond single-SNPs analysis methods, focusing instead on groups of SNPs, for example by analysing haplotypes. More recently, pathway-based methods have been proposed that use prior biological knowledge on gene function to achieve a more powerful analysis of genome-wide association studies (GWAS) data. In this paper we propose a novel Bayesian modeling framework to identify molecular biomarkers for disease prediction. Our method combines pathway-based approaches with multiple SNP analyses of a specified region of interest. The model's development is motivated by SNP data from a lung cancer study. In our approach we define gene-level scores based on SNP allele frequencies and use a linear modeling setting to study the scores association to the observed phenotype. The basic idea behind the definition of gene-level scores is to weigh the SNPs within the gene according to their rarity, based on genotype frequencies expected under the Hardy-Weinberg equilibrium law. This results in scores giving more importance to the unusually low frequencies, i.e. to SNPs that might indicate peculiar genetic differences between subjects belonging to different groups. An additional feature of our approach is that we incorporate information on SNP-to-SNP associations into the model. In particular, we use network priors that model the linkage disequilibrium between SNPs. For posterior inference, we design a stochastic search method that identifies significant biomarkers (genes and SNPs) for disease prediction. We assess performances on simulated data and compare results to existing approaches. We then show the ability of the proposed methodology to detect relevant genes and associated SNPs in a lung cancer dataset.
诸如癌症等复杂疾病源于由多个单核苷酸多态性(SNP)组成的复杂病因,每个SNP对疾病的总体风险贡献较小。因此,许多研究人员已经超越了单SNP分析方法,而是专注于SNP组,例如通过分析单倍型。最近,基于通路的方法被提出来,这些方法利用关于基因功能的先验生物学知识来对全基因组关联研究(GWAS)数据进行更强大的分析。在本文中,我们提出了一种新颖的贝叶斯建模框架来识别用于疾病预测的分子生物标志物。我们的方法将基于通路的方法与对特定感兴趣区域的多个SNP分析相结合。该模型的开发是受肺癌研究的SNP数据驱动。在我们的方法中,我们基于SNP等位基因频率定义基因水平得分,并使用线性建模设置来研究得分与观察到的表型之间的关联。基因水平得分定义背后的基本思想是根据哈迪 - 温伯格平衡定律预期的基因型频率,根据SNP的稀有程度对基因内的SNP进行加权。这导致得分更重视异常低的频率,即可能表明属于不同组的个体之间存在特殊遗传差异的SNP。我们方法的另一个特点是我们将SNP与SNP关联的信息纳入模型。特别是,我们使用对SNP之间的连锁不平衡进行建模的网络先验。对于后验推断,我们设计了一种随机搜索方法,用于识别用于疾病预测的显著生物标志物(基因和SNP)。我们在模拟数据上评估性能,并将结果与现有方法进行比较。然后,我们展示了所提出方法在肺癌数据集中检测相关基因和相关SNP的能力。