Suppr超能文献

一种用于序列数据库比较的快速算法:应用于识别EMBL数据库中的载体污染。

A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.

作者信息

Miller C, Gurd J, Brass A

机构信息

School of Biological Sciences, 2.205 The Stopford Building, University of Manchester, Oxford Road, Manchester M13 9PT, UK.

出版信息

Bioinformatics. 1999 Feb;15(2):111-21. doi: 10.1093/bioinformatics/15.2.111.

Abstract

MOTIVATION

Word-matching algorithms such as BLAST are routinely used for sequence comparison. These algorithms typically use areas of matching words to seed alignments which are then used to assess the degree of sequence similarity. In this paper, we show that by formally separating the word-matching and sequence-alignment process, and using information about word frequencies to generate alignments and similarity scores, we can create a new sequence-comparison algorithm which is both fast and sensitive. The formal split between word searching and alignment allows users to select an appropriate alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence.

RESULTS

We present three algorithms, RAPID, PHAT and SPLAT, which together allow vector contaminations to be found and assessed extremely rapidly. RAPID is a word search algorithm which uses probabilities to modify the significance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of magnitude faster than BLAST. The formal split between word searching and alignment not only offers considerable gains in performance, but also allows alignment generation to be viewed as a user interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an artificial test set allows the optimal score threshold for identifying vector contamination to be determined. ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of the entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (an EST subset of EMBL) finds an error rate of 0.86%, principally due to two large-scale projects.

AVAILABILITY

A Web page for the software exists at http://bioinf.man.ac.uk/rapid, or it can be downloaded from ftp://ftp.bioinf.man.ac.uk/RAPID CONTACT: crispin@cs.man.ac.uk

摘要

动机

诸如BLAST之类的词匹配算法通常用于序列比较。这些算法通常使用匹配词的区域来为比对设定种子,然后用于评估序列相似性的程度。在本文中,我们表明,通过正式分离词匹配和序列比对过程,并使用词频信息来生成比对和相似性得分,我们可以创建一种新的序列比较算法,该算法既快速又灵敏。词搜索和比对之间的正式分离允许用户选择合适的比对方法,而不会影响潜在的相似性搜索。该算法已被用于开发用于识别DNA序列数据库中被载体序列污染的条目的软件。

结果

我们提出了三种算法,RAPID、PHAT和SPLAT,它们共同使得能够极其快速地发现和评估载体污染。RAPID是一种词搜索算法,它使用概率来修改赋予不同词的显著性;PHAT和SPLAT是比对算法。初步实现已显示比BLAST快大约一个数量级。词搜索和比对之间的正式分离不仅在性能上有显著提升,而且还允许将比对生成视为一个用户界面问题,从而可以在不影响潜在相似性搜索的情况下选择最有用的输出方法。对人工测试集的接收者操作特征(ROC)分析允许确定用于识别载体污染的最佳得分阈值。ROC曲线还用于确定发现载体污染的最佳词大小(九个)。对EMBL的整个表达序列标签(EST)子集的分析发现污染率为0.27%。对est10.dat(EMBL的一个EST子集)中的50000个EST进行更详细的分析发现错误率为0.86%,主要是由于两个大型项目。

可用性

该软件的网页位于http://bioinf.man.ac.uk/rapid,或者可以从ftp://ftp.bioinf.man.ac.uk/RAPID下载

联系方式

crispin@cs.man.ac.uk

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验