Gravel Simon
Department of Human Genetics and Génome Québec Innovation Centre, McGill University, Montréal, Quebec H3A 0G1, Canada and
Genetics. 2014 Jun;197(2):601-10. doi: 10.1534/genetics.114.162149. Epub 2014 Mar 17.
Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ∼15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types.
成功的测序实验需要明智地选择样本。然而,这种选择往往必须基于有限的初步数据来进行。基于初步数据预测最终样本的统计特性可能具有挑战性,因为可能涉及众多不确定的模型假设。在这里,我们要问的是,我们能否仅通过对一部分样本进行测序来预测多个样本间的“组学”变异。在无限基因组的极限情况下,我们发现,使用一种与人口统计学、选择或群体结构无关的估计器,对5%的群体进行预实验测序就足以在正确值的6%范围内预测整个人口中的遗传变异数量。在具有数百万个多态性的有限基因组中要达到类似的准确性,预实验则需要约15%的群体。我们提出了计算效率高的留一法和线性规划方法,当应用于模拟数据和 subsampled 1000基因组计划数据时,这些方法的偏差比现有技术小得多。根据美国国立心肺血液研究所外显子测序项目的数据进行推断,我们预测在一个50000名非裔美国人的样本中,捕获区域7.2%的位点会发生变异,在同等规模的欧洲样本中这一比例为8.8%。最后,我们展示了线性规划方法如何还能预测各种基因组特征的发现率,比如不同细胞类型中转录因子结合位点的数量。