利用预期序列特征提高扩增子焦磷酸测序数据的碱基识别准确性。

Using expected sequence features to improve basecalling accuracy of amplicon pyrosequencing data.

作者信息

Rask Thomas S, Petersen Bent, Chen Donald S, Day Karen P, Pedersen Anders Gorm

机构信息

Department of Systems Biology, Center for Biological Sequence Analysis, Technical University of Denmark, Building 208, Kongens Lyngby, DK-2800, Denmark.

Division of Medical Parasitology, Department of Microbiology, New York University Langone Medical Center, 341 East 25th Street, New York, NY, 10010, USA.

出版信息

BMC Bioinformatics. 2016 Apr 22;17:176. doi: 10.1186/s12859-016-1032-7.

DOI:10.1186/s12859-016-1032-7

PMID:27102804

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4841065/

Abstract

BACKGROUND

Amplicon pyrosequencing targets a known genetic region and thus inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame. Pyrosequencing errors, consisting mainly of nucleotide insertions and deletions, are on the other hand likely to disrupt open reading frames. Such an inverse relationship between errors and expectation based on prior knowledge can be used advantageously to guide the process known as basecalling, i.e. the inference of nucleotide sequence from raw sequencing data.

RESULTS

The new basecalling method described here, named Multipass, implements a probabilistic framework for working with the raw flowgrams obtained by pyrosequencing. For each sequence variant Multipass calculates the likelihood and nucleotide sequence of several most likely sequences given the flowgram data. This probabilistic approach enables integration of basecalling into a larger model where other parameters can be incorporated, such as the likelihood for observing a full-length open reading frame at the targeted region. We apply the method to 454 amplicon pyrosequencing data obtained from a malaria virulence gene family, where Multipass generates 20 % more error-free sequences than current state of the art methods, and provides sequence characteristics that allow generation of a set of high confidence error-free sequences.

CONCLUSIONS

This novel method can be used to increase accuracy of existing and future amplicon sequencing data, particularly where extensive prior knowledge is available about the obtained sequences, for example in analysis of the immunoglobulin VDJ region where Multipass can be combined with a model for the known recombining germline genes. Multipass is available for Roche 454 data at http://www.cbs.dtu.dk/services/MultiPass-1.0 , and the concept can potentially be implemented for other sequencing technologies as well.

摘要

背景

扩增子焦磷酸测序针对已知的基因区域，因此本质上产生的读数很可能具有某些特征，例如保守的核苷酸序列，对于蛋白质编码DNA而言，则具有开放阅读框。另一方面，焦磷酸测序错误主要由核苷酸插入和缺失组成，很可能会破坏开放阅读框。基于先验知识的错误与预期之间的这种反比关系可有利地用于指导称为碱基识别的过程，即从原始测序数据推断核苷酸序列。

结果

这里描述的新碱基识别方法名为Multipass，它实现了一个概率框架，用于处理通过焦磷酸测序获得的原始流动图。对于每个序列变体，Multipass根据流动图数据计算几种最可能序列的似然性和核苷酸序列。这种概率方法能够将碱基识别集成到一个更大的模型中，在该模型中可以纳入其他参数，例如在目标区域观察到全长开放阅读框的似然性。我们将该方法应用于从疟疾毒力基因家族获得的454扩增子焦磷酸测序数据，其中Multipass生成的无错误序列比当前的先进方法多20%，并提供了能够生成一组高可信度无错误序列的序列特征。

结论

这种新方法可用于提高现有和未来扩增子测序数据的准确性，特别是在对获得的序列有广泛先验知识的情况下，例如在免疫球蛋白VDJ区域的分析中，Multipass可以与已知重组种系基因的模型相结合。Multipass可在http://www.cbs.dtu.dk/services/MultiPass-1.0获取罗氏454数据，并且该概念也有可能应用于其他测序技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e4d/4841065/63b9207b0075/12859_2016_1032_Fig1_HTML.jpg

相似文献

Using expected sequence features to improve basecalling accuracy of amplicon pyrosequencing data.利用预期序列特征提高扩增子焦磷酸测序数据的碱基识别准确性。

BMC Bioinformatics. 2016 Apr 22;17:176. doi: 10.1186/s12859-016-1032-7.

Genetic polymorphism and amino acid sequence variation in Plasmodium falciparum GLURP R2 repeat region in Assam, India, at an interval of five years.印度阿萨姆邦恶性疟原虫GLURP R2重复区域的遗传多态性和氨基酸序列变异，间隔五年研究一次。

Malar J. 2014 Nov 21;13:450. doi: 10.1186/1475-2875-13-450.

NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm.NanoReviser：一种基于深度学习算法的纳米孔测序纠错工具。

Front Genet. 2020 Aug 12;11:900. doi: 10.3389/fgene.2020.00900. eCollection 2020.

Performance of neural network basecalling tools for Oxford Nanopore sequencing.基于神经网络的牛津纳米孔测序碱基调用工具的性能。

Genome Biol. 2019 Jun 24;20(1):129. doi: 10.1186/s13059-019-1727-y.

PfADA2, a Plasmodium falciparum homologue of the transcriptional coactivator ADA2 and its in vivo association with the histone acetyltransferase PfGCN5.PfADA2，一种恶性疟原虫转录共激活因子ADA2的同源物及其在体内与组蛋白乙酰转移酶PfGCN5的关联。

Gene. 2004 Jul 21;336(2):251-61. doi: 10.1016/j.gene.2004.04.005.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Bayesian basecalling for DNA sequence analysis using hidden Markov models.使用隐马尔可夫模型进行DNA序列分析的贝叶斯碱基识别

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):430-440. doi: 10.1109/tcbb.2007.1027.

Basecalling with LifeTrace.使用LifeTrace进行碱基识别

Genome Res. 2001 May;11(5):875-88. doi: 10.1101/gr.177901.

[Sequence analyzing and genotyping of the gene encoding glutamate rich protein of geographically different Plasmodium falciparum isolates obtained from different malaria endemic areas].[对来自不同疟疾流行地区的地理上不同的恶性疟原虫分离株中富含谷氨酸蛋白编码基因的序列分析和基因分型]

Zhongguo Ji Sheng Chong Xue Yu Ji Sheng Chong Bing Za Zhi. 2000;18(1):1-4.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign：利用氨基酸促进蛋白质编码DNA序列的多重比对。

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

引用本文的文献

Var genes, strain hyperdiversity, and malaria transmission dynamics.变异基因、菌株高度多样性与疟疾传播动力学

Trends Parasitol. 2025 Jun;41(6):471-485. doi: 10.1016/j.pt.2025.04.010. Epub 2025 May 19.

A paradoxical population structure of var DBLα types in Africa.非洲var DBLα类型的一种矛盾的种群结构。

PLoS Pathog. 2025 Feb 4;21(2):e1012813. doi: 10.1371/journal.ppat.1012813. eCollection 2025 Feb.

Whole Genome Sequencing Contributions and Challenges in Disease Reduction Focused on Malaria.全基因组测序在以疟疾为重点的疾病防控中的贡献与挑战

Biology (Basel). 2022 Apr 13;11(4):587. doi: 10.3390/biology11040587.

Evolutionary analyses of the major variant surface antigen-encoding genes reveal population structure of Plasmodium falciparum within and between continents.对主要变异表面抗原编码基因的进化分析揭示了疟原虫在大陆内部和之间的种群结构。

PLoS Genet. 2021 Feb 25;17(2):e1009269. doi: 10.1371/journal.pgen.1009269. eCollection 2021 Feb.

Detection of low-density Plasmodium falciparum infections using amplicon deep sequencing.利用扩增子深度测序检测低密度恶性疟原虫感染。

Malar J. 2019 Jul 1;18(1):219. doi: 10.1186/s12936-019-2856-1.

Signatures of competition and strain structure within the major blood-stage antigen of in a local community in Ghana.加纳当地社区中疟原虫主要血液期抗原内的竞争和菌株结构特征

Ecol Evol. 2018 Mar 1;8(7):3574-3588. doi: 10.1002/ece3.3803. eCollection 2018 Apr.

Evolutionary structure of major variant surface antigen genes in South America: Implications for epidemic transmission and surveillance.南美洲主要变异表面抗原基因的进化结构：对流行传播和监测的影响。

Ecol Evol. 2017 Oct 8;7(22):9376-9390. doi: 10.1002/ece3.3425. eCollection 2017 Nov.

Sanger and Next-Generation Sequencing data for characterization of CTL epitopes in archived HIV-1 proviral DNA.用于鉴定存档的HIV-1前病毒DNA中CTL表位的桑格测序法和新一代测序数据。

PLoS One. 2017 Sep 21;12(9):e0185211. doi: 10.1371/journal.pone.0185211. eCollection 2017.

Population genomics of virulence genes of Plasmodium falciparum in clinical isolates from Uganda.乌干达临床分离株中恶性疟原虫毒力基因的群体基因组学研究。

Sci Rep. 2017 Sep 18;7(1):11810. doi: 10.1038/s41598-017-11814-9.

A new method for sequencing the hypervariable Plasmodium falciparum gene var2csa from clinical samples.一种从临床样本中对高变区 Plasmodium falciparum 基因 var2csa 进行测序的新方法。

Malar J. 2017 Aug 17;16(1):343. doi: 10.1186/s12936-017-1976-8.

本文引用的文献

Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing.通过深度测序分析疟原虫自然感染的多样性。

Nature. 2012 Jul 19;487(7407):375-9. doi: 10.1038/nature11174.

Multilocus sequence typing of total-genome-sequenced bacteria.全基因组测序细菌的多位点序列分型。

J Clin Microbiol. 2012 Apr;50(4):1355-61. doi: 10.1128/JCM.06094-11. Epub 2012 Jan 11.

UCHIME improves sensitivity and speed of chimera detection.UCHIME 提高了嵌合体检测的灵敏度和速度。

Bioinformatics. 2011 Aug 15;27(16):2194-200. doi: 10.1093/bioinformatics/btr381. Epub 2011 Jun 23.

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.454 GS-FLX Titanium 焦磷酸测序准确性和质量评估。

BMC Genomics. 2011 May 19;12:245. doi: 10.1186/1471-2164-12-245.

A molecular epidemiological study of var gene diversity to characterize the reservoir of Plasmodium falciparum in humans in Africa.对非洲人群中疟原虫 falciparum 裂殖子基因多样性的分子流行病学研究，以鉴定其储存库。

PLoS One. 2011 Feb 9;6(2):e16629. doi: 10.1371/journal.pone.0016629.

Removing noise from pyrosequenced amplicons.从焦磷酸测序扩增子中去除噪声。

BMC Bioinformatics. 2011 Jan 28;12:38. doi: 10.1186/1471-2105-12-38.

Plasmodium falciparum erythrocyte membrane protein 1 diversity in seven genomes--divide and conquer.恶性疟原虫红细胞膜蛋白 1 的多样性——分而治之。

PLoS Comput Biol. 2010 Sep 16;6(9):e1000933. doi: 10.1371/journal.pcbi.1000933.

Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim.454 焦磷酸测序数据的特征——使用 flowsim 进行现实模拟。

Bioinformatics. 2010 Sep 15;26(18):i420-5. doi: 10.1093/bioinformatics/btq365.

Search and clustering orders of magnitude faster than BLAST.比 BLAST 快几个数量级的搜索和聚类。

Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.

Accurate determination of microbial diversity from 454 pyrosequencing data.从454焦磷酸测序数据中准确测定微生物多样性。

Nat Methods. 2009 Sep;6(9):639-41. doi: 10.1038/nmeth.1361. Epub 2009 Aug 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用预期序列特征提高扩增子焦磷酸测序数据的碱基识别准确性。

Using expected sequence features to improve basecalling accuracy of amplicon pyrosequencing data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献