下一代测序（NGS）读段中模式出现的正态和复合泊松近似

Normal and compound poisson approximations for pattern occurrences in NGS reads.

作者信息

Zhai Zhiyuan, Reinert Gesine, Song Kai, Waterman Michael S, Luan Yihui, Sun Fengzhu

机构信息

School of Mathematics, Shandong University, Jinan, Shandong, China.

出版信息

J Comput Biol. 2012 Jun;19(6):839-54. doi: 10.1089/cmb.2012.0029.

DOI:10.1089/cmb.2012.0029

PMID:22697250

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3375642/

Abstract

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).

摘要

新一代测序（NGS）技术如今在许多生物学研究中得到广泛应用。在NGS中，序列读数是从感兴趣的基因组序列中随机抽样得到的。大多数用于NGS数据的计算方法首先将读数映射到基因组，然后基于映射后的读数分析数据。由于许多生物体的基因组序列未知，而且即使基因组序列已知，许多读数也无法唯一地映射到基因组上，因此需要替代分析方法来研究NGS数据。在此，我们建议使用词模式来分析NGS数据。词模式计数（研究一个或多个长序列中词模式出现次数的概率分布）在分子序列分析中发挥了重要作用。然而，目前尚无关于NGS读数中词模式出现次数分布的研究。在本文中，我们为背景序列以及从基因组中读取序列的抽样过程建立概率模型。基于这些模型，我们为序列读数中词模式的出现次数提供正态近似和复合泊松近似，并给出近似误差的界。主要挑战在于考虑生成长背景序列时的随机性，以及使用NGS对读数进行抽样时的随机性。我们展示了在各种条件下针对具有不同特征的不同模式这些近似的准确性。在实际假设下，复合泊松近似在大多数情况下似乎优于正态近似。这些近似分布可用于评估NGS数据中模式出现的统计显著性。然后，用于计算近似分布的理论和计算算法被用于分析使用转录因子GABP的ChIP-Seq数据。软件可在线获取（www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html）。此外，补充材料可在线找到（www.liebertonline.com/cmb）。

相似文献

Normal and compound poisson approximations for pattern occurrences in NGS reads.下一代测序（NGS）读段中模式出现的正态和复合泊松近似

J Comput Biol. 2012 Jun;19(6):839-54. doi: 10.1089/cmb.2012.0029.

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.从二代测序数据推断分子序列的马尔可夫性质及其在比较基因组学中的应用。

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads.序列深度不变，测序量更少：解决模糊映射读取的贝叶斯方法。

PLoS Comput Biol. 2021 Apr 19;17(4):e1008926. doi: 10.1371/journal.pcbi.1008926. eCollection 2021 Apr.

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species.MetaCluster 4.0：一种用于NGS读数和大量物种的新型分箱算法。

J Comput Biol. 2012 Feb;19(2):241-9. doi: 10.1089/cmb.2011.0276.

Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis.在基因组序列上定位 reads：算法概述与实际比较分析

J Comput Biol. 2012 Jun;19(6):796-813. doi: 10.1089/cmb.2012.0022. Epub 2012 Apr 16.

Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns.基于下一代测序读段和可变长度模式的无组装基因组比较。

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2105-15-S9-S1. Epub 2014 Sep 10.

AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework.AlignerBoost：一种基于贝叶斯映射质量框架提高下一代测序映射准确性的通用软件工具包。

PLoS Comput Biol. 2016 Oct 5;12(10):e1005096. doi: 10.1371/journal.pcbi.1005096. eCollection 2016 Oct.

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.无比对序列比较的新进展：度量、统计学与新一代测序

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

Alignment-free sequence comparison based on next-generation sequencing reads.基于新一代测序读数的无比对序列比较。

J Comput Biol. 2013 Feb;20(2):64-79. doi: 10.1089/cmb.2012.0228.

NGSReadsTreatment - A Cuckoo Filter-based Tool for Removing Duplicate Reads in NGS Data.NGSReadsTreatment - 一种基于布谷鸟过滤器的工具，用于去除 NGS 数据中的重复读取。

Sci Rep. 2019 Aug 12;9(1):11681. doi: 10.1038/s41598-019-48242-w.

引用本文的文献

Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.基于下一代测序 reads 数据的马尔可夫链转移概率的置信区间

Quant Biol. 2020 Jul 13;8(2):143-154. doi: 10.1007/s40484-020-0200-y. Epub 2020 May 25.

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.无比对序列比较的新进展：度量、统计学与新一代测序

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

本文引用的文献

Alignment-free sequence comparison based on next-generation sequencing reads.基于新一代测序读数的无比对序列比较。

J Comput Biol. 2013 Feb;20(2):64-79. doi: 10.1089/cmb.2012.0228.

Modeling non-uniformity in short-read rates in RNA-Seq data.RNA-Seq 数据中短读率非均匀性建模。

Genome Biol. 2010;11(5):R50. doi: 10.1186/gb-2010-11-5-r50. Epub 2010 May 11.

The power of detecting enriched patterns: an HMM approach.检测富集模式的能力：一种隐马尔可夫模型方法。

J Comput Biol. 2010 Apr;17(4):581-92. doi: 10.1089/cmb.2009.0218.

Biases in Illumina transcriptome sequencing caused by random hexamer priming.Illumina 转录组测序中随机六聚体引物引起的偏倚。

Nucleic Acids Res. 2010 Jul;38(12):e131. doi: 10.1093/nar/gkq224. Epub 2010 Apr 14.

Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution.原核生物全蛋白质组系统发育的特征频率分布：一种具有最优特征分辨率的无比对方法。

Proc Natl Acad Sci U S A. 2010 Jan 5;107(1):133-8. doi: 10.1073/pnas.0913033107. Epub 2009 Dec 14.

Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method.采用无比对方法对大型双链DNA病毒家族进行全蛋白质组系统发育分析。

Proc Natl Acad Sci U S A. 2009 Aug 4;106(31):12826-31. doi: 10.1073/pnas.0905115106. Epub 2009 Jun 24.

Application of 'next-generation' sequencing technologies to microbial genetics.“下一代”测序技术在微生物遗传学中的应用。

Nat Rev Microbiol. 2009 Apr;7(4):287-96. doi: 10.1038/nrmicro2122.

Counting of oligomers in sequences generated by markov chains for DNA motif discovery.用于DNA基序发现的马尔可夫链生成序列中寡聚物的计数。

J Bioinform Comput Biol. 2009 Feb;7(1):39-54. doi: 10.1142/s0219720009003935.

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.基于特征频率谱（FFP）和最优分辨率的无比对基因组比较

Proc Natl Acad Sci U S A. 2009 Feb 24;106(8):2677-82. doi: 10.1073/pnas.0813249106. Epub 2009 Feb 2.

Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data.基于染色质免疫沉淀测序（ChIP-Seq）数据的转录因子结合位点全基因组分析。

Nat Methods. 2008 Sep;5(9):829-34. doi: 10.1038/nmeth.1246.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。