Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.
Genet Epidemiol. 2012 Sep;36(6):549-60. doi: 10.1002/gepi.21648. Epub 2012 Jun 6.
Next-generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, thus improving statistical power for disease-associated variant detection. Several methods for disease association tests of pooled data and for optimal pooling designs have been developed under certain assumptions of the pooling process, for example, equal/unequal contributions to the pool, sequencing depth variation, and error rate. However, these simplified assumptions may not portray the many factors affecting pooled sequencing data quality, such as PCR amplification during target capture and sequencing, reference allele preferential bias, and others. As a result, the properties of the observed data may differ substantially from those expected under the simplified assumptions. Here, we use real datasets from targeted sequencing of pooled samples, together with microarray SNP genotypes of the same subjects, to identify and quantify factors (biases and errors) affecting the observed sequencing data. Through simulations, we find that these factors have a significant impact on the accuracy of allele frequency estimation and the power of association tests. Furthermore, we develop a workflow protocol to incorporate these factors in data analysis to reduce the potential biases and errors in pooled sequencing data and to gain better estimation of allele frequencies. The workflow, Psafe, is available at http://bioinformatics.med.yale.edu/group/.
下一代测序技术因其能够在没有先验单核苷酸多态性 (SNP) 信息的情况下识别常见和罕见变体,因此被广泛用于研究复杂疾病。对有意义的目标区域进行合并测序可以降低成本,并允许分析更多的样本,从而提高与疾病相关的变异检测的统计能力。已经针对合并数据的疾病关联测试和最佳合并设计开发了几种方法,这些方法是在合并过程的某些假设下进行的,例如,对合并的均等/不均等贡献、测序深度变化和错误率。然而,这些简化的假设可能无法描绘出影响合并测序数据质量的许多因素,例如目标捕获和测序过程中的 PCR 扩增、参考等位基因偏倚等。因此,观察到的数据的特性可能与简化假设下预期的数据特性有很大的不同。在这里,我们使用来自合并样本靶向测序的真实数据集,以及相同受试者的微阵列 SNP 基因型,来识别和量化影响观察到的测序数据的因素(偏差和错误)。通过模拟,我们发现这些因素对等位基因频率估计的准确性和关联测试的功效有重大影响。此外,我们开发了一种工作流程协议,将这些因素纳入数据分析中,以减少合并测序数据中的潜在偏差和错误,并更好地估计等位基因频率。该工作流程 Psafe 可在 http://bioinformatics.med.yale.edu/group/ 获得。