Anderson I, Brass A
School of Biological Sciences, University of Manchester, 2.205 Stopford Building, Oxford Road, Manchester M13 9PT, UK.
Bioinformatics. 1998;14(4):349-56. doi: 10.1093/bioinformatics/14.4.349.
Searching DNA sequences against a DNA database is an essential element of sequence analysis. However, few systematic studies have been carried out to determine when a match between two DNA sequences has biological significance and this is limiting the use that can be made of DNA searching algorithms.
A test set of DNA sequences has been constructed consisting of artificially evolved and real sequences. This set has been used to test various database searching algorithms (BLAST, BLAST2, FASTA and Smith-Waterman) on a subset of the EMBL database. The results of this analysis have been used to determine the sensitivity and coverage of all of the algorithms. Guidelines have been produced which can be used to assess the significance of DNA database search results. The Smith-Waterman algorithm was shown to have the best coverage, but the worst sensitivity, whereas the default BLASTN algorithm (word length set to 11) was shown to have good sensitivity, but poor coverage. A sensible compromise between speed, sensitivity and coverage can be obtained using either the FASTA or BLAST (word length set to 6) algorithms. However, analysis of the results also showed that no algorithm works well when the length of the probe sequence is <200 bases. In general, matches can accurately be identified between coding regions of DNA sequences when there is >35% sequence identity between the corresponding proteins. Searching a DNA sequence against a DNA sequence database can, therefore, be a useful tool in sequence analysis.
The test sets used are available via anonymous ftp from mbisg2.sbc.man.ac.uk in the directory /pub/cabios/testdata/
在DNA数据库中搜索DNA序列是序列分析的一个基本要素。然而,很少有系统的研究来确定两个DNA序列之间的匹配何时具有生物学意义,这限制了DNA搜索算法的应用。
构建了一个由人工进化序列和真实序列组成的DNA序列测试集。该测试集已用于在EMBL数据库的一个子集上测试各种数据库搜索算法(BLAST、BLAST2、FASTA和Smith-Waterman)。该分析结果已用于确定所有算法的灵敏度和覆盖率。已制定了可用于评估DNA数据库搜索结果显著性的指导方针。结果表明,Smith-Waterman算法具有最佳的覆盖率,但灵敏度最差,而默认的BLASTN算法(字长设置为11)具有良好的灵敏度,但覆盖率较差。使用FASTA或BLAST(字长设置为6)算法可以在速度、灵敏度和覆盖率之间取得合理的折衷。然而,结果分析还表明,当探针序列长度<200个碱基时,没有一种算法能很好地工作。一般来说,当相应蛋白质之间的序列同一性>35%时,可以准确地识别DNA序列编码区之间的匹配。因此,在DNA序列数据库中搜索DNA序列可以成为序列分析中的一个有用工具。
所使用的测试集可通过匿名ftp从mbisg2.sbc.man.ac.uk的/pub/cabios/testdata/目录获取。