Sun Yanni, Buhler Jeremy
Department of Computer Science and Engineering, Washington University, St. Louis, MO, USA.
BMC Bioinformatics. 2006 Mar 13;7:133. doi: 10.1186/1471-2105-7-133.
Seeded alignment is an important component of algorithms for fast, large-scale DNA similarity search. A good seed matching heuristic can reduce the execution time of genomic-scale sequence comparison without degrading sensitivity. Recently, many types of seed have been proposed to improve on the performance of traditional contiguous seeds as used in, e.g., NCBI BLASTN. Choosing among these seed types, particularly those that use information besides the presence or absence of matching residue pairs, requires practical guidance based on a rigorous comparison, including assessment of sensitivity, specificity, and computational efficiency. This work performs such a comparison, focusing on alignments in DNA outside widely studied coding regions.
We compare seeds of several types, including those allowing transition mutations rather than matches at fixed positions, those allowing transitions at arbitrary positions ("BLASTZ" seeds), and those using a more general scoring matrix. For each seed type, we use an extended version of our Mandala seed design software to choose seeds with optimized sensitivity for various levels of specificity. Our results show that, on a test set biased toward alignments of noncoding DNA, transition information significantly improves seed performance, while finer distinctions between different types of mismatches do not. BLASTZ seeds perform especially well. These results depend on properties of our test set that are not shared by EST-based test sets with a strong bias toward coding DNA.
Practical seed design requires careful attention to the properties of the alignments being sought. For noncoding DNA sequences, seeds that use transition information, especially BLASTZ-style seeds, are particularly useful. The Mandala seed design software can be found at http://www.cse.wustl.edu/~yanni/mandala/.
种子比对是快速、大规模DNA相似性搜索算法的重要组成部分。一个良好的种子匹配启发式方法可以减少基因组规模序列比对的执行时间,同时不降低灵敏度。最近,人们提出了多种类型的种子,以改进如NCBI BLASTN中使用的传统连续种子的性能。在这些种子类型中进行选择,特别是那些除了匹配残基对的存在与否之外还使用其他信息的种子类型,需要基于严格比较的实用指导,包括灵敏度、特异性和计算效率的评估。这项工作进行了这样的比较,重点关注广泛研究的编码区域之外的DNA比对。
我们比较了几种类型的种子,包括允许转换突变而非固定位置匹配的种子、允许在任意位置进行转换的种子(“BLASTZ”种子)以及使用更通用评分矩阵的种子。对于每种种子类型,我们使用我们的曼陀罗种子设计软件的扩展版本来选择针对不同特异性水平具有优化灵敏度的种子。我们的结果表明,在偏向于非编码DNA比对的测试集上,转换信息显著提高了种子性能,而不同类型错配之间更细微的区别则没有。BLASTZ种子表现尤其出色。这些结果取决于我们测试集的特性,而具有强烈编码DNA偏向性的基于EST的测试集并不具备这些特性。
实际的种子设计需要仔细关注所寻求比对的特性。对于非编码DNA序列,使用转换信息的种子,特别是BLASTZ风格的种子,特别有用。曼陀罗种子设计软件可在http://www.cse.wustl.edu/~yanni/mandala/上找到。