Suppr超能文献

一种用于估计公共数据库中16S rRNA序列测序错误的简单二项式检验。

A simple binomial test for estimating sequencing errors in public repository 16S rRNA sequences.

作者信息

Zo Young-Gun, Colwell Rita R

机构信息

Center of Marine Biotechnology, University of Maryland Biotechnology Institute, 701 E. Pratt Street, Baltimore, MD 21202, USA,

出版信息

J Microbiol Methods. 2008 Feb;72(2):166-79. doi: 10.1016/j.mimet.2007.11.013. Epub 2007 Nov 23.

Abstract

Sequences in public databases may contain a number of sequencing errors. A double binomial model describing the distribution of indel-excluded similarity coefficients (S) among repeatedly sequenced 16S rRNA was previously developed and it produced a confidence interval of S useful for testing sequence identity among sequences of 400-bp length. We characterized patterns in sequencing errors found in nearly complete 16S rRNA sequences of Vibrionaceae as highly variable in reported sequence length and containing a small number of indels. To accommodate these characteristics, a simple binomial model for distribution of the similarity coefficient (H) that included indels was derived from the double binomial model for S. The model showed good fit to empirical data. By using either a pre-determined or bootstrapping estimated standard probability of base matching, we were able to use the exact binomial test to determine the relative level of sequencing error for a given pair of duplicated sequences. A limitation of the method is the requirement that duplicated sequences for the same template sequence be paired, but this can be overcome by using only conserved regions of 16S rRNA sequences and pairing a given sequence with its highest scoring BLAST search hit from the nr database of GenBank.

摘要

公共数据库中的序列可能包含一些测序错误。先前已开发出一种双二项式模型,用于描述重复测序的16S rRNA中插入缺失排除相似性系数(S)的分布,该模型产生了一个S的置信区间,可用于测试400 bp长度序列之间的序列同一性。我们将弧菌科几乎完整的16S rRNA序列中发现的测序错误模式表征为报告的序列长度高度可变且包含少量插入缺失。为适应这些特征,从S的双二项式模型推导出了一个包含插入缺失的相似性系数(H)分布的简单二项式模型。该模型与经验数据拟合良好。通过使用预先确定的或自展估计的碱基匹配标准概率,我们能够使用精确二项式检验来确定给定一对重复序列的相对测序错误水平。该方法的一个局限性是需要将相同模板序列的重复序列配对,但这可以通过仅使用16S rRNA序列的保守区域,并将给定序列与其在GenBank的nr数据库中得分最高的BLAST搜索命中序列配对来克服。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验