Suppr超能文献

使用单分子测序数据进行 SNP 调用和单倍型组装的渐进方法。

Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data.

机构信息

School of Computer Science and Technology, Tianjin University, Tianjin Haihe Education Park, Tianjin, China.

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.

出版信息

Bioinformatics. 2018 Jun 15;34(12):2012-2018. doi: 10.1093/bioinformatics/bty059.

Abstract

MOTIVATION

Haplotype information is essential to the complete description and interpretation of genomes, genetic diversity and genetic ancestry. The new technologies can provide Single Molecular Sequencing (SMS) data that cover about 90% of positions over chromosomes. However, the SMS data has a higher error rate comparing to 1% error rate for short reads. Thus, it becomes very difficult for SNP calling and haplotype assembly using SMS reads. Most existing technologies do not work properly for the SMS data.

RESULTS

In this paper, we develop a progressive approach for SNP calling and haplotype assembly that works very well for the SMS data. Our method can handle more than 200 million non-N bases on Chromosome 1 with millions of reads, more than 100 blocks, each of which contains more than 2 million bases and more than 3K SNP sites on average. Experiment results show that the false discovery rate and false negative rate for our method are 15.7 and 11.0% on NA12878, and 16.5 and 11.0% on NA24385. Moreover, the overall switch errors for our method are 7.26 and 5.21 with average 3378 and 5736 SNP sites per block on NA12878 and NA24385, respectively. Here, we demonstrate that SMS reads alone can generate a high quality solution for both SNP calling and haplotype assembly.

AVAILABILITY AND IMPLEMENTATION

Source codes and results are available at https://github.com/guofeieileen/SMRT/wiki/Software.

摘要

动机

单倍型信息对于基因组的完整描述和解释、遗传多样性和遗传祖源至关重要。新技术可以提供覆盖染色体上约 90%位置的单分子测序 (SMS) 数据。然而,与短读长 1%的错误率相比,SMS 数据的错误率更高。因此,使用 SMS 读取进行 SNP 调用和单倍型组装非常困难。大多数现有技术无法正确处理 SMS 数据。

结果

在本文中,我们开发了一种用于 SNP 调用和单倍型组装的渐进方法,该方法非常适用于 SMS 数据。我们的方法可以处理超过 2 亿个非 N 碱基的染色体 1 数据,使用数百万个读取,超过 100 个块,每个块包含超过 200 万个碱基和平均超过 3000 个 SNP 位点。实验结果表明,我们的方法在 NA12878 上的假阳性率和假阴性率分别为 15.7%和 11.0%,在 NA24385 上的假阳性率和假阴性率分别为 16.5%和 11.0%。此外,我们的方法的整体切换错误率分别为 7.26%和 5.21%,平均每个块有 3378 和 5736 个 SNP 位点。在这里,我们证明 SMS 读取本身可以为 SNP 调用和单倍型组装生成高质量的解决方案。

可用性和实现

源代码和结果可在 https://github.com/guofeieileen/SMRT/wiki/Software 上获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验