School of Computer Science and Technology, Tianjin University, Tianjin Haihe Education Park, Tianjin, China.
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.
Bioinformatics. 2018 Jun 15;34(12):2012-2018. doi: 10.1093/bioinformatics/bty059.
Haplotype information is essential to the complete description and interpretation of genomes, genetic diversity and genetic ancestry. The new technologies can provide Single Molecular Sequencing (SMS) data that cover about 90% of positions over chromosomes. However, the SMS data has a higher error rate comparing to 1% error rate for short reads. Thus, it becomes very difficult for SNP calling and haplotype assembly using SMS reads. Most existing technologies do not work properly for the SMS data.
In this paper, we develop a progressive approach for SNP calling and haplotype assembly that works very well for the SMS data. Our method can handle more than 200 million non-N bases on Chromosome 1 with millions of reads, more than 100 blocks, each of which contains more than 2 million bases and more than 3K SNP sites on average. Experiment results show that the false discovery rate and false negative rate for our method are 15.7 and 11.0% on NA12878, and 16.5 and 11.0% on NA24385. Moreover, the overall switch errors for our method are 7.26 and 5.21 with average 3378 and 5736 SNP sites per block on NA12878 and NA24385, respectively. Here, we demonstrate that SMS reads alone can generate a high quality solution for both SNP calling and haplotype assembly.
Source codes and results are available at https://github.com/guofeieileen/SMRT/wiki/Software.
单倍型信息对于基因组的完整描述和解释、遗传多样性和遗传祖源至关重要。新技术可以提供覆盖染色体上约 90%位置的单分子测序 (SMS) 数据。然而,与短读长 1%的错误率相比,SMS 数据的错误率更高。因此,使用 SMS 读取进行 SNP 调用和单倍型组装非常困难。大多数现有技术无法正确处理 SMS 数据。
在本文中,我们开发了一种用于 SNP 调用和单倍型组装的渐进方法,该方法非常适用于 SMS 数据。我们的方法可以处理超过 2 亿个非 N 碱基的染色体 1 数据,使用数百万个读取,超过 100 个块,每个块包含超过 200 万个碱基和平均超过 3000 个 SNP 位点。实验结果表明,我们的方法在 NA12878 上的假阳性率和假阴性率分别为 15.7%和 11.0%,在 NA24385 上的假阳性率和假阴性率分别为 16.5%和 11.0%。此外,我们的方法的整体切换错误率分别为 7.26%和 5.21%,平均每个块有 3378 和 5736 个 SNP 位点。在这里,我们证明 SMS 读取本身可以为 SNP 调用和单倍型组装生成高质量的解决方案。
源代码和结果可在 https://github.com/guofeieileen/SMRT/wiki/Software 上获得。