Posfai J, Roberts R J
Institute of Biophysics, Hungarian Academy of Science, Szeged.
Proc Natl Acad Sci U S A. 1992 May 15;89(10):4698-702. doi: 10.1073/pnas.89.10.4698.
An algorithm is described that can detect certain errors within coding regions of DNA sequences. The algorithm is based on the idea that an insertion or deletion error within a coding sequence would interrupt the reading frame and cause the correct translation of a DNA sequence to require one or more frameshifts. If the coding sequence shows similarity to a known protein sequence then such errors can be detected by comparing the conceptual translations of DNA sequences in all six reading frames with every sequence in a protein sequence data base. We have incorporated these ideas into a computer program, called DETECT, that can serve as an aid to the experimentalist who is determining new DNA sequences so that obvious errors may be located and corrected. The program has been tested using raw experimental data and against sequences from the European Molecular Biology Laboratory data base, annotated as containing frameshifts. We have also tested it using unidentified open reading frames that flank known, annotated genes in the GenBank data base. Many potential errors are apparent and in some cases functions can be suggested for the "corrected" versions of these reading frames leading to the identification of new genes. As more sequences are determined the power of this method will increase substantially.
本文描述了一种算法,该算法能够检测DNA序列编码区域内的特定错误。该算法基于这样一种理念:编码序列中的插入或缺失错误会中断阅读框,并导致DNA序列的正确翻译需要一个或多个移码。如果编码序列与已知蛋白质序列具有相似性,那么通过将DNA序列在所有六个阅读框中的概念性翻译与蛋白质序列数据库中的每个序列进行比较,就可以检测到此类错误。我们已将这些理念整合到一个名为DETECT的计算机程序中,该程序可为正在确定新DNA序列的实验人员提供帮助,以便定位并纠正明显的错误。该程序已使用原始实验数据进行测试,并与欧洲分子生物学实验室数据库中注释为包含移码的序列进行比对。我们还使用GenBank数据库中已知注释基因侧翼的未鉴定开放阅读框对其进行了测试。许多潜在错误显而易见,在某些情况下,可以为这些阅读框的“校正”版本提出功能建议,从而鉴定出新基因。随着更多序列被确定,这种方法的效力将大幅提高。