Pattaro Cristian, Ruczinski Ingo, Fallin Danièle M, Parmigiani Giovanni
Unit of Genetic Epidemiology and Biostatistics, Institute of Genetic Medicine, European Academy, Viale Druso 1, I-39100, Bolzano, Italy.
BMC Genomics. 2008 Aug 29;9:405. doi: 10.1186/1471-2164-9-405.
Identification of disease-related genes in association studies is challenged by the large number of SNPs typed. To address the dilution of power caused by high dimensionality, and to generate results that are biologically interpretable, it is critical to take into consideration spatial correlation of SNPs along the genome. With the goal of identifying true genetic associations, partitioning the genome according to spatial correlation can be a powerful and meaningful way to address this dimensionality problem.
We developed and validated an MCMC Algorithm To Identify blocks of Linkage DisEquilibrium (MATILDE) for clustering contiguous SNPs, and a statistical testing framework to detect association using partitions as units of analysis. We compared its ability to detect true SNP associations to that of the most commonly used algorithm for block partitioning, as implemented in the Haploview and HapBlock software. Simulations were based on artificially assigning phenotypes to individuals with SNPs corresponding to region 14q11 of the HapMap database. When block partitioning is performed using MATILDE, the ability to correctly identify a disease SNP is higher, especially for small effects, than it is with the alternatives considered. Advantages can be both in terms of true positive findings and limiting the number of false discoveries. Finer partitions provided by LD-based methods or by marker-by-marker analysis are efficient only for detecting big effects, or in presence of large sample sizes. The probabilistic approach we propose offers several additional advantages, including: a) adapting the estimation of blocks to the population, technology, and sample size of the study; b) probabilistic assessment of uncertainty about block boundaries and about whether any two SNPs are in the same block; c) user selection of the probability threshold for assigning SNPs to the same block.
We demonstrate that, in realistic scenarios, our adaptive, study-specific block partitioning approach is as or more efficient than currently available LD-based approaches in guiding the search for disease loci.
在关联研究中,疾病相关基因的识别受到大量分型单核苷酸多态性(SNP)的挑战。为应对高维度导致的效能稀释,并产生具有生物学可解释性的结果,考虑SNP在基因组上的空间相关性至关重要。为了识别真正的基因关联,根据空间相关性对基因组进行划分可能是解决这一维度问题的有效且有意义的方法。
我们开发并验证了一种用于识别连锁不平衡块的MCMC算法(MATILDE),用于对相邻SNP进行聚类,并开发了一个统计检验框架,以分区作为分析单位来检测关联。我们将其检测真实SNP关联的能力与Haploview和HapBlock软件中实现的最常用的块划分算法进行了比较。模拟基于将表型人工分配给具有与HapMap数据库14q11区域相对应的SNP的个体。当使用MATILDE进行块划分时,与所考虑的其他方法相比,正确识别疾病SNP的能力更高,尤其是对于小效应的情况。优势体现在真阳性发现以及限制假发现数量方面。基于连锁不平衡的方法或逐个标记分析提供的更精细分区仅在检测大效应或样本量较大时才有效。我们提出的概率方法还具有其他几个优点,包括:a)使块的估计适应研究的人群、技术和样本量;b)对块边界以及任意两个SNP是否在同一块中的不确定性进行概率评估;c)用户可选择将SNP分配到同一块的概率阈值。
我们证明,在实际场景中,我们的适应性、针对特定研究的块划分方法在指导疾病基因座搜索方面与目前可用的基于连锁不平衡的方法一样有效或更有效。