Center for Human Genetics Research, Vanderbilt University, Nashville, Tennessee, USA.
Adv Genet. 2010;72:1-24. doi: 10.1016/B978-0-12-380862-2.00001-1.
Simulated data is a necessary first step in the evaluation of new analytic methods because in simulated data the true effects are known. To successfully develop novel statistical and computational methods for genetic analysis, it is vital to simulate datasets consisting of single nucleotide polymorphisms (SNPs) spread throughout the genome at a density similar to that observed by new high-throughput molecular genomics studies. In addition, the simulation of environmental data and effects will be essential to properly formulate risk models for complex disorders. Data simulations are often criticized because they are much less noisy than natural biological data, as it is nearly impossible to simulate the multitude of possible sources of natural and experimental variability. However, simulating data in silico is the most straightforward way to test the true potential of new methods during development. Thus, advances that increase the complexity of data simulations will permit investigators to better assess new analytical methods. In this work, we will briefly describe some of the current approaches for the simulation of human genomics data describing the advantages and disadvantages of the various approaches. We will also include details on software packages available for data simulation. Finally, we will expand upon one particular approach for the creation of complex, human genomic datasets that uses a forward-time population simulation algorithm: genomeSIMLA. Many of the hallmark features of biological datasets can be synthesized in silico; still much research is needed to enhance our capabilities to create datasets that capture the natural complexity of biological datasets.
模拟数据是评估新分析方法的必要第一步,因为在模拟数据中,真实效应是已知的。为了成功开发用于遗传分析的新型统计和计算方法,至关重要的是要模拟包含单核苷酸多态性(SNP)的数据集,这些 SNP 分布在基因组中,密度与新的高通量分子基因组学研究中观察到的相似。此外,模拟环境数据和效应对于正确制定复杂疾病的风险模型也将至关重要。数据模拟经常受到批评,因为它们的噪声比自然生物数据小得多,因为几乎不可能模拟自然和实验变异性的众多可能来源。然而,在计算机中模拟数据是在开发过程中测试新方法真实潜力的最直接方法。因此,增加数据模拟复杂性的进展将使研究人员能够更好地评估新的分析方法。在这项工作中,我们将简要描述当前用于模拟人类基因组学数据的一些方法,介绍各种方法的优缺点。我们还将介绍用于数据模拟的软件包的详细信息。最后,我们将详细介绍一种用于创建使用正向时间群体模拟算法的复杂人类基因组数据集的特定方法:genomeSIMLA。许多生物数据集的标志性特征都可以在计算机中合成;仍需要进行大量研究,以增强我们创建能够捕获生物数据集自然复杂性的数据集的能力。