Department of Preventive Medicine, Keck School of Medicine, USC, Los Angeles, California, USA.
Genet Epidemiol. 2012 Nov;36(7):696-709. doi: 10.1002/gepi.21664. Epub 2012 Aug 3.
Next-generation sequencing technology provides us with vast amounts of sequence data. It is efficient and cheaper than previous sequencing technologies, but deep resequencing of entire samples is still expensive. Therefore, sensible strategies for choosing subsets of samples to sequence are required. Here we describe an algorithm for selection of a sub-sample of an existing sample if one has either of two possible goals in mind: maximizing the number of new polymorphic sites that are detected, or improving the efficiency with which the remaining unsequenced individuals can have their types imputed at newly discovered polymorphisms. We then describe a variation on our algorithm that is more focused on detecting rarer variants. We demonstrate the performance of our algorithm using simulated data and data from the 1000 Genomes Project.
下一代测序技术为我们提供了大量的序列数据。与以前的测序技术相比,它效率更高,成本更低,但对整个样本进行深度重测序仍然很昂贵。因此,需要明智的策略来选择要测序的样本子集。在这里,我们描述了一种算法,如果您有两个可能的目标之一,那么可以从现有样本中选择一个子样本:最大化检测到的新多态性位点的数量,或者提高以新发现的多态性对剩余未测序个体进行类型推断的效率。然后,我们描述了我们的算法的一个变体,该变体更侧重于检测罕见的变体。我们使用模拟数据和 1000 基因组计划的数据来演示我们算法的性能。