National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA and Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan.
Bioinformatics. 2014 Dec 15;30(24):3575-82. doi: 10.1093/bioinformatics/btu576. Epub 2014 Aug 28.
The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score.
We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
将 DNA 序列与蛋白质进行比对,允许移码,这是序列分析中的一种经典方法。它可以帮助识别假基因(积累突变),分析原始 DNA 和 RNA 序列数据(可能存在移码测序错误),研究核糖体移码等。然而,通常只有特定的近似值或模拟值可用于提供移码比对得分的统计显著性。
我们描述了一种估计移码比对统计显著性的方法,类似于经典的 BLAST 统计。(BLAST 目前不允许其比对包含移码。)我们还通过两个“后基因组”应用来说明移码比对的持续有用性:(i)在人类基因组中寻找假基因时,移码比对表明,最古老的保守非编码人类元素是最近的具有保守祖先基因的假基因;(ii)在分析受污染土壤的宏基因组 DNA 读取时,移码比对表明,大多数可比对的宏基因组读取都包含移码,这表明宏基因组分析需要使用移码比对来得出准确的结果。