寻找DNA序列中的错误。

Finding errors in DNA sequences.

作者信息

Posfai J, Roberts R J

机构信息

Institute of Biophysics, Hungarian Academy of Science, Szeged.

出版信息

Proc Natl Acad Sci U S A. 1992 May 15;89(10):4698-702. doi: 10.1073/pnas.89.10.4698.

DOI:10.1073/pnas.89.10.4698

PMID:1316617

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC49150/

Abstract

An algorithm is described that can detect certain errors within coding regions of DNA sequences. The algorithm is based on the idea that an insertion or deletion error within a coding sequence would interrupt the reading frame and cause the correct translation of a DNA sequence to require one or more frameshifts. If the coding sequence shows similarity to a known protein sequence then such errors can be detected by comparing the conceptual translations of DNA sequences in all six reading frames with every sequence in a protein sequence data base. We have incorporated these ideas into a computer program, called DETECT, that can serve as an aid to the experimentalist who is determining new DNA sequences so that obvious errors may be located and corrected. The program has been tested using raw experimental data and against sequences from the European Molecular Biology Laboratory data base, annotated as containing frameshifts. We have also tested it using unidentified open reading frames that flank known, annotated genes in the GenBank data base. Many potential errors are apparent and in some cases functions can be suggested for the "corrected" versions of these reading frames leading to the identification of new genes. As more sequences are determined the power of this method will increase substantially.

摘要

本文描述了一种算法，该算法能够检测DNA序列编码区域内的特定错误。该算法基于这样一种理念：编码序列中的插入或缺失错误会中断阅读框，并导致DNA序列的正确翻译需要一个或多个移码。如果编码序列与已知蛋白质序列具有相似性，那么通过将DNA序列在所有六个阅读框中的概念性翻译与蛋白质序列数据库中的每个序列进行比较，就可以检测到此类错误。我们已将这些理念整合到一个名为DETECT的计算机程序中，该程序可为正在确定新DNA序列的实验人员提供帮助，以便定位并纠正明显的错误。该程序已使用原始实验数据进行测试，并与欧洲分子生物学实验室数据库中注释为包含移码的序列进行比对。我们还使用GenBank数据库中已知注释基因侧翼的未鉴定开放阅读框对其进行了测试。许多潜在错误显而易见，在某些情况下，可以为这些阅读框的“校正”版本提出功能建议，从而鉴定出新基因。随着更多序列被确定，这种方法的效力将大幅提高。

相似文献

Finding errors in DNA sequences.寻找DNA序列中的错误。

Proc Natl Acad Sci U S A. 1992 May 15;89(10):4698-702. doi: 10.1073/pnas.89.10.4698.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Comparison of DNA sequences with protein sequences.DNA序列与蛋白质序列的比较。

Genomics. 1997 Nov 15;46(1):24-36. doi: 10.1006/geno.1997.4995.

PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames.逐对比较与搜索比较：在将蛋白质谱与所有DNA翻译框架进行同步比较时找到最佳比对。

Nucleic Acids Res. 1996 Jul 15;24(14):2730-9. doi: 10.1093/nar/24.14.2730.

Alignments of DNA and protein sequences containing frameshift errors.包含移码错误的DNA和蛋白质序列比对。

Comput Appl Biosci. 1996 Feb;12(1):31-40. doi: 10.1093/bioinformatics/12.1.31.

[Identification of the open reading frame coding transposase of Bordetella pertussis RS-element].[百日咳博德特氏菌RS元件转座酶编码开放阅读框的鉴定]

Mol Gen Mikrobiol Virusol. 2004(2):16-23.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign：利用氨基酸促进蛋白质编码DNA序列的多重比对。

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

Frameshifting is required for production of the transposase encoded by insertion sequence 1.转座酶的产生需要移码，转座酶由插入序列1编码。

Proc Natl Acad Sci U S A. 1989 Jun;86(12):4609-13. doi: 10.1073/pnas.86.12.4609.

Identification of protein coding regions by database similarity search.通过数据库相似性搜索鉴定蛋白质编码区域。

Nat Genet. 1993 Mar;3(3):266-72. doi: 10.1038/ng0393-266.

Frame: detection of genomic sequencing errors.框架：基因组测序错误的检测

Bioinformatics. 1998;14(4):367-71. doi: 10.1093/bioinformatics/14.4.367.

引用本文的文献

Error and error mitigation in low-coverage genome assemblies.低覆盖度基因组组装中的错误与错误缓解。

PLoS One. 2011 Feb 14;6(2):e17034. doi: 10.1371/journal.pone.0017034.

Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes.比较蛋白质基因组学：结合质谱分析和比较基因组学以分析多个基因组。

Genome Res. 2008 Jul;18(7):1133-42. doi: 10.1101/gr.074344.107. Epub 2008 Apr 21.

ICDS database: interrupted CoDing sequences in prokaryotic genomes.ICDS数据库：原核生物基因组中的中断编码序列

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D338-43. doi: 10.1093/nar/gkj060.

Segmentally variable genes: a new perspective on adaptation.分段可变基因：适应的新视角。

PLoS Biol. 2004 Apr;2(4):E81. doi: 10.1371/journal.pbio.0020081. Epub 2004 Apr 13.

Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence.检测和分析DNA测序错误：迈向更高质量的枯草芽孢杆菌基因组序列

Genome Res. 1999 Nov;9(11):1116-27. doi: 10.1101/gr.9.11.1116.

Nucleic Acids Res. 1996 Jul 15;24(14):2730-9. doi: 10.1093/nar/24.14.2730.

Assignment of position-specific error probability to primary DNA sequence data.将特定位置的错误概率分配到原始DNA序列数据中。

Nucleic Acids Res. 1994 Apr 11;22(7):1272-80. doi: 10.1093/nar/22.7.1272.

A frameshift error detection algorithm for DNA sequencing projects.一种用于DNA测序项目的移码错误检测算法。

Nucleic Acids Res. 1995 Aug 11;23(15):2900-8. doi: 10.1093/nar/23.15.2900.

Polymorphism, monomorphism, and sequences in conserved microsatellites in primate species.灵长类物种中保守微卫星的多态性、单态性及序列

J Mol Evol. 1995 Oct;41(4):492-7. doi: 10.1007/BF00160321.

本文引用的文献

Recognition of protein coding regions in DNA sequences.DNA序列中蛋白质编码区域的识别。

Nucleic Acids Res. 1982 Sep 11;10(17):5303-18. doi: 10.1093/nar/10.17.5303.

A program for reading DNA sequence gels using a small computer equipped with a graphics tablet.一个使用配备图形输入板的小型计算机读取DNA序列凝胶图的程序。

Nucleic Acids Res. 1982 Jan 11;10(1):27-30. doi: 10.1093/nar/10.1.27.

A semi-automated method for the reading of nucleic acid sequencing gels.一种用于读取核酸测序凝胶的半自动方法。

Nucleic Acids Res. 1982 Jan 11;10(1):103-14. doi: 10.1093/nar/10.1.103.

Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.从嘌呤/嘧啶基因组序列确定蛋白质阅读框的方法及其可能的进化依据。

Proc Natl Acad Sci U S A. 1981 Mar;78(3):1596-600. doi: 10.1073/pnas.78.3.1596.

Establishing homologies in protein sequences.确定蛋白质序列中的同源性。

Methods Enzymol. 1983;91:524-45. doi: 10.1016/s0076-6879(83)91049-2.

A computer program to enter DNA gel reading data into a computer.一个将DNA凝胶读数数据输入计算机的计算机程序。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):499-503. doi: 10.1093/nar/12.1part2.499.

Computer methods to locate signals in nucleic acid sequences.在核酸序列中定位信号的计算机方法。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505-19. doi: 10.1093/nar/12.1part2.505.

Bacterial peptide chain release factors: conserved primary structure and possible frameshift regulation of release factor 2.细菌肽链释放因子：释放因子2的保守一级结构及可能的移码调控

Proc Natl Acad Sci U S A. 1985 Jun;82(11):3616-20. doi: 10.1073/pnas.82.11.3616.

GEL--a computer tool for DNA sequencing projects.GEL——一种用于DNA测序项目的计算机工具。

Nucleic Acids Res. 1986 Jan 10;14(1):87-98. doi: 10.1093/nar/14.1.87.

The current status and portability of our sequence handling software.我们序列处理软件的当前状态与便携性。

Nucleic Acids Res. 1986 Jan 10;14(1):217-31. doi: 10.1093/nar/14.1.217.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验