Mailund Thomas, Besenbacher Søren, Schierup Mikkel H
Department of Statistics, University of Oxford, UK.
BMC Bioinformatics. 2006 Oct 16;7:454. doi: 10.1186/1471-2105-7-454.
With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed.
We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set--the DeltaF508 mutation for cystic fibrosis--where the susceptibility variant is already known--and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene.
Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
利用当前技术,在关联研究中可以低成本、高效率地生成大量数据。为防止数据分析成为研究的瓶颈,必须开发能够处理如此大规模数据集的快速高效分析方法。
我们提出了一种快速方法,用于在包含大量病例和对照的高密度病例对照关联图谱实验中,准确定位致病变异。该方法在由每个标记周围与单个系统发育树兼容的最大区域所定义的“完美”系统发育树中,搜索病例染色体的显著聚类。这个完美系统发育树被视为用于确定疾病状态的决策树,并根据其作为决策树的准确性进行评分。这样做的基本原理是,靠近影响疾病的突变的完美系统发育,应该比随机树提供更多关于受影响/未受影响分类的信息。如果兼容区域包含的标记很少,例如由于标记间距大,该算法可以允许纳入不兼容标记,以便在估计其系统发育之前扩大区域。单倍型数据和定相基因型数据均可分析。该方法的效能和效率通过以下方式进行研究:1)在不同疾病决定模型下的模拟基因型数据;2)根据HapMap资源创建的人工数据集;3)用于测试其他方法以便与之比较的数据集。在单个致病突变和恒定重组率的最简单情况下,我们的方法与单标记关联(SMA)具有相同的准确性。然而,在突变异质性更复杂以及单倍型结构更复杂的情况下,如在HapMap数据中发现的情况,尽管我们方法显著更快,但它优于SMA以及其他快速数据挖掘方法,如HapMiner和单倍型模式挖掘(HPM)。对于未定相的基因型数据,估计相位的初始步骤只会略微降低该方法的效能。在一个经验数据集中,该方法还被发现能够准确地定位已知的易感性变异——囊性纤维化的DeltaF508突变,其中易感性变异是已知的——并且能够找到CYP2D6基因与药物代谢不良之间关联的显著信号,尽管对于这个数据集,最高关联分数距离CYP2D6基因约60 kb。
我们的方法已在Blossoc(块关联)软件中实现。使用Blossoc,在不到两个CPU小时内就可以分析1000例病例和1000例对照中300万个单核苷酸多态性(SNP)的全基因组芯片调查数据。