使用单分子测序数据进行 SNP 调用和单倍型组装的渐进方法。

Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data.

机构信息

School of Computer Science and Technology, Tianjin University, Tianjin Haihe Education Park, Tianjin, China.

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.

出版信息

Bioinformatics. 2018 Jun 15;34(12):2012-2018. doi: 10.1093/bioinformatics/bty059.

DOI:10.1093/bioinformatics/bty059

PMID:29474523

Abstract

MOTIVATION

Haplotype information is essential to the complete description and interpretation of genomes, genetic diversity and genetic ancestry. The new technologies can provide Single Molecular Sequencing (SMS) data that cover about 90% of positions over chromosomes. However, the SMS data has a higher error rate comparing to 1% error rate for short reads. Thus, it becomes very difficult for SNP calling and haplotype assembly using SMS reads. Most existing technologies do not work properly for the SMS data.

RESULTS

In this paper, we develop a progressive approach for SNP calling and haplotype assembly that works very well for the SMS data. Our method can handle more than 200 million non-N bases on Chromosome 1 with millions of reads, more than 100 blocks, each of which contains more than 2 million bases and more than 3K SNP sites on average. Experiment results show that the false discovery rate and false negative rate for our method are 15.7 and 11.0% on NA12878, and 16.5 and 11.0% on NA24385. Moreover, the overall switch errors for our method are 7.26 and 5.21 with average 3378 and 5736 SNP sites per block on NA12878 and NA24385, respectively. Here, we demonstrate that SMS reads alone can generate a high quality solution for both SNP calling and haplotype assembly.

AVAILABILITY AND IMPLEMENTATION

Source codes and results are available at https://github.com/guofeieileen/SMRT/wiki/Software.

摘要

动机

单倍型信息对于基因组的完整描述和解释、遗传多样性和遗传祖源至关重要。新技术可以提供覆盖染色体上约 90%位置的单分子测序 (SMS) 数据。然而，与短读长 1%的错误率相比，SMS 数据的错误率更高。因此，使用 SMS 读取进行 SNP 调用和单倍型组装非常困难。大多数现有技术无法正确处理 SMS 数据。

结果

在本文中，我们开发了一种用于 SNP 调用和单倍型组装的渐进方法，该方法非常适用于 SMS 数据。我们的方法可以处理超过 2 亿个非 N 碱基的染色体 1 数据，使用数百万个读取，超过 100 个块，每个块包含超过 200 万个碱基和平均超过 3000 个 SNP 位点。实验结果表明，我们的方法在 NA12878 上的假阳性率和假阴性率分别为 15.7%和 11.0%，在 NA24385 上的假阳性率和假阴性率分别为 16.5%和 11.0%。此外，我们的方法的整体切换错误率分别为 7.26%和 5.21%，平均每个块有 3378 和 5736 个 SNP 位点。在这里，我们证明 SMS 读取本身可以为 SNP 调用和单倍型组装生成高质量的解决方案。

可用性和实现

源代码和结果可在 https://github.com/guofeieileen/SMRT/wiki/Software 上获得。

相似文献

Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data.使用单分子测序数据进行 SNP 调用和单倍型组装的渐进方法。

Bioinformatics. 2018 Jun 15;34(12):2012-2018. doi: 10.1093/bioinformatics/bty059.

Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm.通过序贯蒙特卡罗算法进行联合单倍型组装和基因型分型

BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8.

Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data.利用跨越多个单核苷酸多态性的读取信息，从测序数据中推断单倍型。

Bioinformatics. 2013 Sep 15;29(18):2245-52. doi: 10.1093/bioinformatics/btt386. Epub 2013 Jul 3.

NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data.NanoSNP：一种针对低覆盖度纳米孔测序数据的渐进式、单体型感知 SNP 调用程序。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac824.

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing.Longshot 可通过单分子长读测序对二倍体基因组进行准确的变异调用。

Nat Commun. 2019 Oct 11;10(1):4660. doi: 10.1038/s41467-019-12493-y.

DCHap: A Divide-and-Conquer Haplotype Phasing Algorithm for Third-Generation Sequences.DCHap：一种用于第三代测序的分治单倍型相位算法。

IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1277-1284. doi: 10.1109/TCBB.2020.3005673. Epub 2022 Jun 3.

HapCUT2: A Method for Phasing Genomes Using Experimental Sequence Data.HapCUT2：一种使用实验序列数据进行基因组相位分析的方法。

Methods Mol Biol. 2023;2590:139-147. doi: 10.1007/978-1-0716-2819-5_9.

The linkage method: a novel approach for SNP detection and haplotype reconstruction from a single diploid individual using next-generation sequence data.连锁分析法：一种利用新一代测序数据从单个二倍体个体中检测 SNP 和重建单体型的新方法。

Mol Biol Evol. 2013 Sep;30(9):2187-96. doi: 10.1093/molbev/mst103. Epub 2013 May 31.

Integrating read-based and population-based phasing for dense and accurate haplotyping of individual genomes.基于读取和基于群体的相位整合，实现个体基因组的密集和精确单倍型分型。

Bioinformatics. 2019 Jul 15;35(14):i242-i248. doi: 10.1093/bioinformatics/btz329.

SNP calling by sequencing pooled samples.基于测序的混合样本 SNP 检测。

BMC Bioinformatics. 2012 Sep 20;13:239. doi: 10.1186/1471-2105-13-239.

引用本文的文献

Harnessing Multi-Omics and Predictive Modeling for Climate-Resilient Crop Breeding: From Genomes to Fields.利用多组学和预测模型实现气候适应性作物育种：从基因组到田间

Genes (Basel). 2025 Jul 10;16(7):809. doi: 10.3390/genes16070809.

HPTAS: An Alignment-Free Haplotype Phasing Algorithm Focused on Allele-Specific Studies Using Transcriptome Data.HPTAS：一种无比对的单倍型分型算法，专注于利用转录组数据进行等位基因特异性研究。

Int J Mol Sci. 2025 Jun 13;26(12):5700. doi: 10.3390/ijms26125700.

Single-cell whole-genome sequencing, haplotype analysis in prenatal diagnosis of monogenic diseases.单细胞全基因组测序，单基因疾病产前诊断中的单体型分析。

Life Sci Alliance. 2023 Feb 21;6(5). doi: 10.26508/lsa.202201761. Print 2023 May.

Gene-Based Testing of Interactions Using XGBoost in Genome-Wide Association Studies.在全基因组关联研究中使用XGBoost进行基于基因的相互作用测试。

Front Cell Dev Biol. 2021 Dec 16;9:801113. doi: 10.3389/fcell.2021.801113. eCollection 2021.

Testing Gene-Gene Interactions Based on a Neighborhood Perspective in Genome-wide Association Studies.基于全基因组关联研究中邻域视角的基因-基因相互作用检测

Front Genet. 2021 Dec 8;12:801261. doi: 10.3389/fgene.2021.801261. eCollection 2021.

Detecting and phasing minor single-nucleotide variants from long-read sequencing data.从长读测序数据中检测和相位单核苷酸变体。

Nat Commun. 2021 May 24;12(1):3032. doi: 10.1038/s41467-021-23289-4.

PredAmyl-MLP: Prediction of Amyloid Proteins Using Multilayer Perceptron.PredAmyl-MLP：使用多层感知机预测淀粉样蛋白

Comput Math Methods Med. 2020 Nov 20;2020:8845133. doi: 10.1155/2020/8845133. eCollection 2020.

Kernel Fusion Method for Detecting Cancer Subtypes via Selecting Relevant Expression Data.通过选择相关表达数据检测癌症亚型的核融合方法

Front Genet. 2020 Sep 10;11:979. doi: 10.3389/fgene.2020.00979. eCollection 2020.

scHaplotyper: haplotype construction and visualization for genetic diagnosis using single cell DNA sequencing data.scHaplotyper：使用单细胞 DNA 测序数据进行遗传诊断的单体型构建和可视化。

BMC Bioinformatics. 2020 Feb 1;21(1):41. doi: 10.1186/s12859-020-3381-5.

Application of different DNA extraction procedures, library preparation protocols and sequencing platforms: impact on sequencing results.不同DNA提取方法、文库制备方案和测序平台的应用：对测序结果的影响。

Heliyon. 2019 Nov 1;5(10):e02745. doi: 10.1016/j.heliyon.2019.e02745. eCollection 2019 Oct.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用单分子测序数据进行 SNP 调用和单倍型组装的渐进方法。

Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献