Zhai Zhiyuan, Reinert Gesine, Song Kai, Waterman Michael S, Luan Yihui, Sun Fengzhu
School of Mathematics, Shandong University, Jinan, Shandong, China.
J Comput Biol. 2012 Jun;19(6):839-54. doi: 10.1089/cmb.2012.0029.
Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).
新一代测序(NGS)技术如今在许多生物学研究中得到广泛应用。在NGS中,序列读数是从感兴趣的基因组序列中随机抽样得到的。大多数用于NGS数据的计算方法首先将读数映射到基因组,然后基于映射后的读数分析数据。由于许多生物体的基因组序列未知,而且即使基因组序列已知,许多读数也无法唯一地映射到基因组上,因此需要替代分析方法来研究NGS数据。在此,我们建议使用词模式来分析NGS数据。词模式计数(研究一个或多个长序列中词模式出现次数的概率分布)在分子序列分析中发挥了重要作用。然而,目前尚无关于NGS读数中词模式出现次数分布的研究。在本文中,我们为背景序列以及从基因组中读取序列的抽样过程建立概率模型。基于这些模型,我们为序列读数中词模式的出现次数提供正态近似和复合泊松近似,并给出近似误差的界。主要挑战在于考虑生成长背景序列时的随机性,以及使用NGS对读数进行抽样时的随机性。我们展示了在各种条件下针对具有不同特征的不同模式这些近似的准确性。在实际假设下,复合泊松近似在大多数情况下似乎优于正态近似。这些近似分布可用于评估NGS数据中模式出现的统计显著性。然后,用于计算近似分布的理论和计算算法被用于分析使用转录因子GABP的ChIP-Seq数据。软件可在线获取(www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html)。此外,补充材料可在线找到(www.liebertonline.com/cmb)。