Department of Statistics, University of California at Irvine, CA, USA.
Stat Med. 2013 May 30;32(12):2114-26. doi: 10.1002/sim.5680. Epub 2012 Nov 22.
High-throughput scientific studies involving no clear a priori hypothesis are common. For example, a large-scale genomic study of a disease may examine thousands of genes without hypothesizing that any specific gene is responsible for the disease. In these studies, the objective is to explore a large number of possible factors (e.g., genes) in order to identify a small number that will be considered in follow-up studies that tend to be more thorough and on smaller scales. A simple, hierarchical, linear regression model with random coefficients is assumed for case-control data that correspond to each gene. The specific model used will be seen to be related to a standard Bayesian variable selection model. Relatively large regression coefficients correspond to potential differences in responses for cases versus controls and thus to genes that might 'matter'. For large-scale studies, and using a Dirichlet process mixture model for the regression coefficients, we are able to find clusters of regression effects of genes with increasing potential effect or 'relevance', in relation to the outcome of interest. One cluster will always correspond to genes whose coefficients are in a neighborhood that is relatively close to zero and will be deemed least relevant. Other clusters will correspond to increasing magnitudes of the random/latent regression coefficients. Using simulated data, we demonstrate that our approach could be quite effective in finding relevant genes compared with several alternative methods. We apply our model to two large-scale studies. The first study involves transcriptome analysis of infection by human cytomegalovirus. The second study's objective is to identify differentially expressed genes between two types of leukemia.
高通量科学研究通常不涉及明确的先验假设。例如,一项大规模的疾病基因组研究可能会检测数千个基因,而不假设任何特定的基因是导致该疾病的原因。在这些研究中,目的是探索大量可能的因素(例如基因),以便确定少数将在后续研究中考虑的因素,这些后续研究往往更深入、规模更小。假设病例对照数据与每个基因相对应的是具有随机系数的简单层次线性回归模型。使用的具体模型将与标准贝叶斯变量选择模型相关。相对较大的回归系数对应于病例与对照之间的潜在反应差异,因此对应于可能“重要”的基因。对于大规模研究,并使用回归系数的狄利克雷过程混合模型,我们能够找到与感兴趣的结果相关的基因回归效应的聚类,这些聚类的潜在效应或“相关性”逐渐增加。一个聚类将始终对应于系数处于相对接近零的邻域的基因,并且被认为最不相关。其他聚类将对应于随机/潜在回归系数的幅度增加。使用模拟数据,我们表明与几种替代方法相比,我们的方法在发现相关基因方面可能非常有效。我们将模型应用于两项大规模研究。第一项研究涉及人类巨细胞病毒感染的转录组分析。第二项研究的目的是识别两种类型白血病之间差异表达的基因。