Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672.
Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Life Sciences Institute, National University of Singapore, Singapore 117456, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456 and Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672Saw Swee Hock School of Public Health, National University of Singapore, Singapore 11759
Bioinformatics. 2014 Jun 15;30(12):1707-13. doi: 10.1093/bioinformatics/btu067. Epub 2014 Feb 19.
Whole-genome sequencing (WGS) is now routinely used for the detection and identification of genetic variants, particularly single nucleotide polymorphisms (SNPs) in humans, and this has provided valuable new insights into human diversity, population histories and genetic association studies of traits and diseases. However, this relies on accurate detection and genotyping calling of the polymorphisms present in the samples sequenced. To minimize cost, the majority of current WGS studies, including the 1000 Genomes Project (1 KGP) have adopted low coverage sequencing of large number of samples, where such designs have inadvertently influenced the development of variant calling methods on WGS data. Assessment of variant accuracy are usually performed on the same set of low coverage individuals or a smaller number of deeply sequenced individuals. It is thus unclear how these variant calling methods would fare for a dataset of ∼100 samples from a population not part of the 1 KGP that have been sequenced at various coverage depths.
Using down-sampling of the sequencing reads obtained from the Singapore Sequencing Malay Project (SSMP), and a set of SNP calls from the same individuals genotyped on the Illumina Omni1-Quad array, we assessed the sensitivity of SNP detection, accuracy of genotype calls made and variant accuracy for six commonly used variant calling methods of GATK, SAMtools, Consensus Assessment of Sequence and Variation (CASAVA), VarScan, glfTools and SOAPsnp. The results indicate that at 5× coverage depth, the multi-sample callers of GATK and SAMtools yield the best accuracy particularly if the study samples are called together with a large number of individuals such as those from 1000 Genomes Project. If study samples are sequenced at a high coverage depth such as 30×, CASAVA has the highest variant accuracy as compared with the other variant callers assessed.
全基因组测序(WGS)现在常用于检测和鉴定遗传变异,特别是人类中的单核苷酸多态性(SNP),这为人类多样性、种群历史以及与特征和疾病相关的遗传关联研究提供了有价值的新见解。然而,这依赖于对测序样本中存在的多态性进行准确的检测和基因分型调用。为了降低成本,大多数当前的 WGS 研究,包括 1000 基因组计划(1 KGP),都采用了对大量样本进行低覆盖率测序,这种设计方案无意中影响了 WGS 数据中变体调用方法的发展。变体准确性的评估通常是在相同的低覆盖率个体集或少数深度测序个体上进行的。因此,不清楚这些变体调用方法在一个来自非 1 KGP 人群的约 100 个样本数据集上的表现如何,这些样本在不同的覆盖深度下进行了测序。
我们使用新加坡测序马来项目(SSMP)获得的测序读取进行下采样,以及同一组个体在 Illumina Omni1-Quad 阵列上的 SNP 调用,评估了 SNP 检测的灵敏度、基因型调用的准确性以及六种常用变体调用方法(GATK、SAMtools、Consensus Assessment of Sequence and Variation (CASAVA)、VarScan、glfTools 和 SOAPsnp)的变体准确性。结果表明,在 5×覆盖深度下,GATK 和 SAMtools 的多样本调用器具有最佳的准确性,特别是如果将研究样本与大量个体(如 1000 基因组计划中的个体)一起调用时。如果研究样本在高覆盖深度(如 30×)下测序,则与评估的其他变体调用器相比,CASAVA 具有最高的变体准确性。