Gotea Valer, Veeramachaneni Vamsi, Makałowski Wojciech
Institute of Molecular Evolutionary Genetics and Department of Biology, The Pennsylvania State University, 514 Mueller Lab, University Park, PA 16802, USA.
Nucleic Acids Res. 2003 Dec 1;31(23):6935-41. doi: 10.1093/nar/gkg886.
One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is how reliable the results obtained using the default parameter values are, i.e. what fraction of potential matches have been retrieved by these searches. Our primary focus is on the initial hit parameter, also known as the seed or word, used by the NCBI BLASTn, MegaBLAST and other similar programs in searches for similar nucleotide sequences. We show that the use of default values for the initial hit parameter can have a big negative impact on the proportion of potentially similar sequences that are retrieved. We also show how the hit probability of different seeds varies with the minimum length and similarity of sequences desired to be retrieved and describe methods that help in determining appropriate seeds. The experimental results described in this paper illustrate situations in which these methods are most applicable and also show the relationship between the various BLAST parameters.
生物信息学中最常见的活动之一是搜索相似序列。这些搜索通常借助美国国立医学图书馆(NCBI)BLAST家族的程序来进行。由于大多数搜索是按照默认参数常规执行的,因此应该解决的一个问题是使用默认参数值获得的结果有多可靠,即这些搜索检索到了潜在匹配项的几分之几。我们主要关注初始命中参数,也称为种子或词,NCBI BLASTn、MegaBLAST和其他类似程序在搜索相似核苷酸序列时会使用该参数。我们表明,初始命中参数使用默认值可能会对检索到的潜在相似序列的比例产生很大的负面影响。我们还展示了不同种子的命中概率如何随所需检索序列的最小长度和相似性而变化,并描述了有助于确定合适种子的方法。本文所述的实验结果说明了这些方法最适用的情况,还展示了各种BLAST参数之间的关系。