Suppr超能文献

在DNA数据库中搜索与DNA序列的相似性:何时匹配具有显著性?

Searching DNA databases for similarities to DNA sequences: when is a match significant?

作者信息

Anderson I, Brass A

机构信息

School of Biological Sciences, University of Manchester, 2.205 Stopford Building, Oxford Road, Manchester M13 9PT, UK.

出版信息

Bioinformatics. 1998;14(4):349-56. doi: 10.1093/bioinformatics/14.4.349.

Abstract

MOTIVATION

Searching DNA sequences against a DNA database is an essential element of sequence analysis. However, few systematic studies have been carried out to determine when a match between two DNA sequences has biological significance and this is limiting the use that can be made of DNA searching algorithms.

RESULTS

A test set of DNA sequences has been constructed consisting of artificially evolved and real sequences. This set has been used to test various database searching algorithms (BLAST, BLAST2, FASTA and Smith-Waterman) on a subset of the EMBL database. The results of this analysis have been used to determine the sensitivity and coverage of all of the algorithms. Guidelines have been produced which can be used to assess the significance of DNA database search results. The Smith-Waterman algorithm was shown to have the best coverage, but the worst sensitivity, whereas the default BLASTN algorithm (word length set to 11) was shown to have good sensitivity, but poor coverage. A sensible compromise between speed, sensitivity and coverage can be obtained using either the FASTA or BLAST (word length set to 6) algorithms. However, analysis of the results also showed that no algorithm works well when the length of the probe sequence is <200 bases. In general, matches can accurately be identified between coding regions of DNA sequences when there is >35% sequence identity between the corresponding proteins. Searching a DNA sequence against a DNA sequence database can, therefore, be a useful tool in sequence analysis.

AVAILABILITY

The test sets used are available via anonymous ftp from mbisg2.sbc.man.ac.uk in the directory /pub/cabios/testdata/

CONTACT

I.Anderson@stud.man.ac.uk; abrass@man.ac.uk

摘要

动机

在DNA数据库中搜索DNA序列是序列分析的一个基本要素。然而,很少有系统的研究来确定两个DNA序列之间的匹配何时具有生物学意义,这限制了DNA搜索算法的应用。

结果

构建了一个由人工进化序列和真实序列组成的DNA序列测试集。该测试集已用于在EMBL数据库的一个子集上测试各种数据库搜索算法(BLAST、BLAST2、FASTA和Smith-Waterman)。该分析结果已用于确定所有算法的灵敏度和覆盖率。已制定了可用于评估DNA数据库搜索结果显著性的指导方针。结果表明,Smith-Waterman算法具有最佳的覆盖率,但灵敏度最差,而默认的BLASTN算法(字长设置为11)具有良好的灵敏度,但覆盖率较差。使用FASTA或BLAST(字长设置为6)算法可以在速度、灵敏度和覆盖率之间取得合理的折衷。然而,结果分析还表明,当探针序列长度<200个碱基时,没有一种算法能很好地工作。一般来说,当相应蛋白质之间的序列同一性>35%时,可以准确地识别DNA序列编码区之间的匹配。因此,在DNA序列数据库中搜索DNA序列可以成为序列分析中的一个有用工具。

可用性

所使用的测试集可通过匿名ftp从mbisg2.sbc.man.ac.uk的/pub/cabios/testdata/目录获取。

联系方式

I.Anderson@stud.man.ac.ukabrass@man.ac.uk

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验