DNA序列与蛋白质序列的比较。

Comparison of DNA sequences with protein sequences.

作者信息

Pearson W R, Wood T, Zhang Z, Miller W

机构信息

Department of Biochemistry, University of Virginia, Charlottesville 22908, USA.

出版信息

Genomics. 1997 Nov 15;46(1):24-36. doi: 10.1006/geno.1997.4995.

DOI:10.1006/geno.1997.4995

PMID:9403055

Abstract

The FASTA package of sequence comparison programs has been expanded to include FASTX and FASTY, which compare a DNA sequence to a protein sequence database, translating the DNA sequence in three frames and aligning the translated DNA sequence to each sequence in the protein database, allowing gaps and frameshifts. Also new are TFASTX and TFASTY, which compare a protein sequence to a DNA sequence database, translating each sequence in the DNA database in six frames and scoring alignments with gaps and frameshifts. FASTX and TFASTX allow only frameshifts between codons, while FASTY and TFASTY allow substitutions or frameshifts within a codon. We examined the performance of FASTX and FASTY using different gap-opening, gap-extension, frameshift, and nucleotide substitution penalties. In general, FASTX and FASTY perform equivalently when query sequences contain 0-10% errors. We also evaluated the statistical estimates reported by FASTX and FASTY. These estimates are quite accurate, except when an out-of-frame translation produces a low-complexity protein sequence. We used FASTX to scan the Mycoplasma genitalium, Haemophilus influenzae, and Methanococcus jannaschii genomes for unidentified or misidentified protein-coding genes. We found at least 9 new protein-coding genes in the three genomes and at least 35 genes with potentially incorrect boundaries.

摘要

序列比较程序的FASTA软件包已得到扩展，纳入了FASTX和FASTY，它们将DNA序列与蛋白质序列数据库进行比较，以三种阅读框翻译DNA序列，并将翻译后的DNA序列与蛋白质数据库中的每个序列进行比对，允许出现空位和移码。同样新增的是TFASTX和TFASTY，它们将蛋白质序列与DNA序列数据库进行比较，以六种阅读框翻译DNA数据库中的每个序列，并对有空位和移码的比对进行评分。FASTX和TFASTX只允许密码子之间的移码，而FASTY和TFASTY允许密码子内的替换或移码。我们使用不同的空位开放、空位延伸、移码和核苷酸替换罚分来检验FASTX和FASTY的性能。一般来说，当查询序列包含0 - 10%的错误时，FASTX和FASTY的表现相当。我们还评估了FASTX和FASTY报告的统计估计值。这些估计值相当准确，除非框外翻译产生低复杂性的蛋白质序列。我们使用FASTX扫描生殖支原体、流感嗜血杆菌和詹氏甲烷球菌的基因组，以寻找未识别或错误识别的蛋白质编码基因。我们在这三个基因组中发现了至少9个新的蛋白质编码基因以及至少35个边界可能有误的基因。