Zo Young-Gun, Colwell Rita R
Center of Marine Biotechnology, University of Maryland Biotechnology Institute, 701 E. Pratt Street, Baltimore, MD 21202, USA,
J Microbiol Methods. 2008 Feb;72(2):166-79. doi: 10.1016/j.mimet.2007.11.013. Epub 2007 Nov 23.
Sequences in public databases may contain a number of sequencing errors. A double binomial model describing the distribution of indel-excluded similarity coefficients (S) among repeatedly sequenced 16S rRNA was previously developed and it produced a confidence interval of S useful for testing sequence identity among sequences of 400-bp length. We characterized patterns in sequencing errors found in nearly complete 16S rRNA sequences of Vibrionaceae as highly variable in reported sequence length and containing a small number of indels. To accommodate these characteristics, a simple binomial model for distribution of the similarity coefficient (H) that included indels was derived from the double binomial model for S. The model showed good fit to empirical data. By using either a pre-determined or bootstrapping estimated standard probability of base matching, we were able to use the exact binomial test to determine the relative level of sequencing error for a given pair of duplicated sequences. A limitation of the method is the requirement that duplicated sequences for the same template sequence be paired, but this can be overcome by using only conserved regions of 16S rRNA sequences and pairing a given sequence with its highest scoring BLAST search hit from the nr database of GenBank.
公共数据库中的序列可能包含一些测序错误。先前已开发出一种双二项式模型,用于描述重复测序的16S rRNA中插入缺失排除相似性系数(S)的分布,该模型产生了一个S的置信区间,可用于测试400 bp长度序列之间的序列同一性。我们将弧菌科几乎完整的16S rRNA序列中发现的测序错误模式表征为报告的序列长度高度可变且包含少量插入缺失。为适应这些特征,从S的双二项式模型推导出了一个包含插入缺失的相似性系数(H)分布的简单二项式模型。该模型与经验数据拟合良好。通过使用预先确定的或自展估计的碱基匹配标准概率,我们能够使用精确二项式检验来确定给定一对重复序列的相对测序错误水平。该方法的一个局限性是需要将相同模板序列的重复序列配对,但这可以通过仅使用16S rRNA序列的保守区域,并将给定序列与其在GenBank的nr数据库中得分最高的BLAST搜索命中序列配对来克服。