Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, Durham, NC, USA.
BMC Bioinformatics. 2018 Jan 2;19(1):2. doi: 10.1186/s12859-017-2004-2.
To evaluate statistical methods for genome-wide genetic analyses, one needs to be able to simulate realistic genotypes. We here describe a method, applicable to a broad range of association study designs, that can simulate autosome-wide single-nucleotide polymorphism data with realistic linkage disequilibrium and with spiked in, user-specified, single or multi-SNP causal effects.
Our construction uses existing genome-wide association data from unrelated case-parent triads, augmented by including a hypothetical complement triad for each triad (same parents but with a hypothetical offspring who carries the non-transmitted parental alleles). We assign offspring qualitative or quantitative traits probabilistically through a specified risk model and show that our approach destroys the risk signals from the original data. Our method can simulate genetically homogeneous or stratified populations and can simulate case-parents studies, case-control studies, case-only studies, or studies of quantitative traits. We show that allele frequencies and linkage disequilibrium structure in the original genome-wide association sample are preserved in the simulated data. We have implemented our method in an R package (TriadSim) which is freely available at the comprehensive R archive network.
We have proposed a method for simulating genome-wide SNP data with realistic linkage disequilibrium. Our method will be useful for developing statistical methods for studying genetic associations, including higher order effects like epistasis and gene by environment interactions.
为了评估全基因组遗传分析的统计方法,需要能够模拟真实的基因型。我们在这里描述了一种方法,适用于广泛的关联研究设计,可以模拟具有真实连锁不平衡和用户指定的、单或多-SNP 因果效应的全染色体单核苷酸多态性数据。
我们的构建使用了来自无关病例-父母三态的现有全基因组关联数据,并通过为每个三态增加一个假设的互补三态来扩展(相同的父母,但有一个假设的后代携带非传递的亲本等位基因)。我们通过指定的风险模型概率地为后代分配定性或定量特征,并表明我们的方法破坏了原始数据中的风险信号。我们的方法可以模拟遗传同质或分层人群,并且可以模拟病例-父母研究、病例-对照研究、仅病例研究或定量特征研究。我们表明,原始全基因组关联样本中的等位基因频率和连锁不平衡结构在模拟数据中得以保留。我们已经在 R 包(TriadSim)中实现了我们的方法,该方法可在综合 R 档案网络上免费获得。
我们提出了一种模拟具有真实连锁不平衡的全基因组 SNP 数据的方法。我们的方法将有助于开发用于研究遗传关联的统计方法,包括更高阶效应,如上位性和基因与环境相互作用。