Ionita-Laza Iuliana, Laird Nan M
Columbia University, USA.
Stat Appl Genet Mol Biol. 2010;9(1):Article33. doi: 10.2202/1544-6115.1581. Epub 2010 Aug 27.
The recent emergence of massively parallel sequencing technologies has enabled an increasing number of human genome re-sequencing studies, notable among them being the 1000 Genomes Project. The main aim of these studies is to identify the yet unknown genetic variants in a genomic region, mostly low frequency variants (frequency less than 5%). We propose here a set of statistical tools that address how to optimally design such studies in order to increase the number of genetic variants we expect to discover. Within this framework, the tradeoff between lower coverage for more individuals and higher coverage for fewer individuals can be naturally solved. The methods here are also useful for estimating the number of genetic variants missed in a discovery study performed at low coverage. We show applications to simulated data based on coalescent models and to sequence data from the ENCODE project. In particular, we show the extent to which combining data from multiple populations in a discovery study may increase the number of genetic variants identified relative to studies on single populations.
近期大规模平行测序技术的出现使得越来越多的人类基因组重测序研究得以开展,其中引人注目的是千人基因组计划。这些研究的主要目的是识别基因组区域中尚未知晓的遗传变异,其中大多数是低频变异(频率小于5%)。我们在此提出一套统计工具,用于解决如何最优地设计此类研究,以增加预期发现的遗传变异数量。在此框架内,为更多个体提供较低覆盖度与为较少个体提供较高覆盖度之间的权衡能够自然得到解决。这里的方法对于估计在低覆盖度下进行的发现研究中遗漏的遗传变异数量也很有用。我们展示了这些方法在基于合并模型的模拟数据以及来自ENCODE计划的序列数据上的应用。特别是,我们展示了在发现研究中合并多个群体的数据相对于单群体研究而言,在多大程度上可能增加所识别的遗传变异数量。