Suppr超能文献

突变率和背景大小对病原体鉴定质量的影响。

Effect of the mutation rate and background size on the quality of pathogen identification.

作者信息

Reed Chris, Fofanov Viacheslav, Putonti Catherine, Chumakov Sergei, Slezak Tom, Fofanov Yuriy

机构信息

Department of Computer Science, University of Houston, 501 Philip G. Hoffman Hall, Houston, TX 77204, USA.

出版信息

Bioinformatics. 2007 Oct 15;23(20):2665-71. doi: 10.1093/bioinformatics/btm420. Epub 2007 Sep 19.

Abstract

MOTIVATION

Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes.

RESULTS

In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is >5%.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基于基因组的方法在快速准确鉴定复杂环境样本(空气、水、土壤、食物等)中的生物体甚至感兴趣的基因方面具有巨大潜力,尤其是在由于各种原因无法对目标生物体进行分离的情况下。尽管有这种潜力,但未知的、可变的且通常数量庞大的背景DNA的存在会造成干扰,导致假阳性结果。

结果

为了评估背景的基因组多样性(背景中存在的所有不同基因组的总长度)、目标长度和目标突变率如何影响错误识别的概率,我们基于个体特征的长度以及将该特征转化为背景中最接近的子序列所需的错配数,引入了一种在有背景存在时个体特征质量的数学定义。这个定义与一个概率框架相结合,使人们能够预测在存在不同大小背景的情况下识别目标所需的最小特征长度,以及目标突变率对其识别质量的影响。使用蒙特卡罗模拟和实际基因组数据示例对模型假设和预测进行了验证。所提出的模型可用于确定针对目标和背景基因组大小的各种组合合适的特征长度。它还预测,如果目标的突变率>5%,任何基因组特征都将无法识别目标。

补充信息

补充数据可在《生物信息学》在线获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验