Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, 60, Murray Street, Toronto, ON, M5T 3L9, Canada.
Department of Statistical Sciences, University of Toronto, Toronto, M5S 3G3, Canada.
BMC Bioinformatics. 2019 Jan 15;20(1):26. doi: 10.1186/s12859-019-2611-1.
Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming.
To address these issues, we propose a new user-friendly and integrated R package, sim1000G, which simulates variants in genomic regions among unrelated individuals or among families. The only input needed is a raw phased Variant Call Format (VCF) file. Haplotypes are extracted to compute linkage disequilibrium (LD) in the simulated genomic regions and for the generation of new genotype data among unrelated individuals. The covariance across variants is used to preserve the LD structure of the original population. Pedigrees of arbitrary sizes are generated by modeling recombination events with sim1000G. To illustrate the application of sim1000G, various scenarios are presented assuming unrelated individuals from a single population or two distinct populations, or alternatively for three-generation pedigree data. Sim1000G can capture allele frequency diversity, short and long-range linkage disequilibrium (LD) patterns and subtle population differences in LD structure without the need of any tuning parameters.
Sim1000G fills a gap in the vast area of genetic variants simulators by its simplicity and independence from external tools. Currently, it is one of the few simulation packages completely integrated into R and able to simulate multiple genetic variants among unrelated individuals and within families. Its implementation will facilitate the application and development of computational methods for association studies with both rare and common variants.
在人类和动物遗传学领域,评估统计方法经常需要模拟遗传变异数据。虽然已经开发了许多高质量的遗传模拟器,但其中许多需要在群体遗传学或计算方面的高级知识才能有效地使用。此外,在基于家庭的研究中生成模拟数据需要复杂的方法和高级计算机编程。
为了解决这些问题,我们提出了一个新的用户友好且集成的 R 包 sim1000G,用于模拟无关个体或家庭中基因组区域的变异。唯一需要的输入是原始相位变异调用格式 (VCF) 文件。提取单倍型以计算模拟基因组区域中的连锁不平衡 (LD) 并生成无关个体之间的新基因型数据。跨变体的协方差用于保留原始群体的 LD 结构。通过使用 sim1000G 模拟重组事件来生成任意大小的系谱。为了说明 sim1000G 的应用,我们提出了各种场景,假设来自单个群体或两个不同群体的无关个体,或者替代为三代系谱数据。sim1000G 可以捕获等位基因频率多样性、短和长程连锁不平衡 (LD) 模式以及 LD 结构中的细微群体差异,而无需任何调整参数。
sim1000G 通过其简单性和对外部工具的独立性,填补了遗传变异模拟器广泛领域中的空白。目前,它是少数几个完全集成到 R 中的模拟包之一,能够模拟无关个体和家庭内的多个遗传变异。它的实现将促进关联研究中稀有和常见变异的计算方法的应用和发展。