掌握用于基因组大小核苷酸BLAST搜索的种子

Mastering seeds for genomic size nucleotide BLAST searches.

作者信息

Gotea Valer, Veeramachaneni Vamsi, Makałowski Wojciech

机构信息

Institute of Molecular Evolutionary Genetics and Department of Biology, The Pennsylvania State University, 514 Mueller Lab, University Park, PA 16802, USA.

出版信息

Nucleic Acids Res. 2003 Dec 1;31(23):6935-41. doi: 10.1093/nar/gkg886.

DOI:10.1093/nar/gkg886

PMID:14627826

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC290255/

Abstract

One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is how reliable the results obtained using the default parameter values are, i.e. what fraction of potential matches have been retrieved by these searches. Our primary focus is on the initial hit parameter, also known as the seed or word, used by the NCBI BLASTn, MegaBLAST and other similar programs in searches for similar nucleotide sequences. We show that the use of default values for the initial hit parameter can have a big negative impact on the proportion of potentially similar sequences that are retrieved. We also show how the hit probability of different seeds varies with the minimum length and similarity of sequences desired to be retrieved and describe methods that help in determining appropriate seeds. The experimental results described in this paper illustrate situations in which these methods are most applicable and also show the relationship between the various BLAST parameters.

摘要

生物信息学中最常见的活动之一是搜索相似序列。这些搜索通常借助美国国立医学图书馆（NCBI）BLAST家族的程序来进行。由于大多数搜索是按照默认参数常规执行的，因此应该解决的一个问题是使用默认参数值获得的结果有多可靠，即这些搜索检索到了潜在匹配项的几分之几。我们主要关注初始命中参数，也称为种子或词，NCBI BLASTn、MegaBLAST和其他类似程序在搜索相似核苷酸序列时会使用该参数。我们表明，初始命中参数使用默认值可能会对检索到的潜在相似序列的比例产生很大的负面影响。我们还展示了不同种子的命中概率如何随所需检索序列的最小长度和相似性而变化，并描述了有助于确定合适种子的方法。本文所述的实验结果说明了这些方法最适用的情况，还展示了各种BLAST参数之间的关系。

相似文献

Mastering seeds for genomic size nucleotide BLAST searches.

Nucleic Acids Res. 2003 Dec 1;31(23):6935-41. doi: 10.1093/nar/gkg886.

High speed BLASTN: an accelerated MegaBLAST search tool.

Nucleic Acids Res. 2015 Sep 18;43(16):7762-8. doi: 10.1093/nar/gkv784. Epub 2015 Aug 6.

G-BLASTN: accelerating nucleotide alignment by graphics processors.

Bioinformatics. 2014 May 15;30(10):1384-91. doi: 10.1093/bioinformatics/btu047. Epub 2014 Jan 24.

Using BLAST for performing sequence alignment.

Curr Protoc Hum Genet. 2007 Jan;Chapter 6:Unit 6.8. doi: 10.1002/0471142905.hg0608s52.

Curr Protoc Bioinformatics. 2009 Jun;Chapter 3:3.3.1-3.3.26. doi: 10.1002/0471250953.bi0303s26.

Curr Protoc Bioinformatics. 2017 Jun 27;58:3.3.1-3.3.25. doi: 10.1002/cpbi.29.

iBLAST: Incremental BLAST of new sequences via automated e-value correction.

PLoS One. 2021 Apr 22;16(4):e0249410. doi: 10.1371/journal.pone.0249410. eCollection 2021.

Adaptive seeds tame genomic sequence comparison.

Genome Res. 2011 Mar;21(3):487-93. doi: 10.1101/gr.113985.110. Epub 2011 Jan 5.

Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation.

J Mol Recognit. 2005 Mar-Apr;18(2):139-49. doi: 10.1002/jmr.721.

Comparing compressed sequences for faster nucleotide BLAST searches.

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):349-64. doi: 10.1109/TCBB.2007.1029.

引用本文的文献

Seedability: optimizing alignment parameters for sensitive sequence comparison.

Bioinform Adv. 2023 Aug 12;3(1):vbad108. doi: 10.1093/bioadv/vbad108. eCollection 2023.

Effects of spaced k-mers on alignment-free genotyping.

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i213-i221. doi: 10.1093/bioinformatics/btad202.

Entropy predicts sensitivity of pseudorandom seeds.

Genome Res. 2023 Jul;33(7):1162-1174. doi: 10.1101/gr.277645.123. Epub 2023 May 22.

TrEMOLO: accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches.

Genome Biol. 2023 Apr 3;24(1):63. doi: 10.1186/s13059-023-02911-2.

Complete Genome Sequences of Bacteriophages Wes44 and Carmen17.

Microbiol Resour Announc. 2019 Mar 21;8(12):e01103-18. doi: 10.1128/MRA.01103-18.

muBLASTP: database-indexed protein sequence search on multicore CPUs.

BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4.

Complete Genome Sequence of Bacillus Phage Belinda from Grand Cayman Island.

Genome Announc. 2016 Oct 13;4(5):e00571-16. doi: 10.1128/genomeA.00571-16.

Complete Genome Sequence of Bacillus megaterium Bacteriophage Eldridge.

Genome Announc. 2016 Apr 21;4(2):e01728-15. doi: 10.1128/genomeA.01728-15.

Dissection of the octoploid strawberry genome by deep sequencing of the genomes of Fragaria species.

DNA Res. 2014;21(2):169-81. doi: 10.1093/dnares/dst049. Epub 2013 Nov 26.

CRISPRTarget: bioinformatic prediction and analysis of crRNA targets.

RNA Biol. 2013 May;10(5):817-27. doi: 10.4161/rna.24046. Epub 2013 Mar 14.

本文引用的文献

Serial BLAST searching.

Bioinformatics. 2003 Aug 12;19(12):1492-6. doi: 10.1093/bioinformatics/btg199.

Human-mouse alignments with BLASTZ.

Genome Res. 2003 Jan;13(1):103-7. doi: 10.1101/gr.809403.

Initial sequencing and comparative analysis of the mouse genome.

Nature. 2002 Dec 5;420(6915):520-62. doi: 10.1038/nature01262.

Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes.

Science. 2002 Aug 23;297(5585):1301-10. doi: 10.1126/science.1072104. Epub 2002 Jul 25.

PatternHunter: faster and more sensitive homology search.

Bioinformatics. 2002 Mar;18(3):440-5. doi: 10.1093/bioinformatics/18.3.440.

BLAT--the BLAST-like alignment tool.

Genome Res. 2002 Apr;12(4):656-64. doi: 10.1101/gr.229202.

The human genome structure and organization.

Acta Biochim Pol. 2001;48(3):587-98.

A greedy algorithm for aligning DNA sequences.

J Comput Biol. 2000 Feb-Apr;7(1-2):203-14. doi: 10.1089/10665270050081478.

REPuter: fast computation of maximal repeats in complete genomes.

Bioinformatics. 1999 May;15(5):426-7. doi: 10.1093/bioinformatics/15.5.426.

BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences.

FEMS Microbiol Lett. 1999 May 15;174(2):247-50. doi: 10.1111/j.1574-6968.1999.tb13575.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

掌握用于基因组大小核苷酸BLAST搜索的种子

Mastering seeds for genomic size nucleotide BLAST searches.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献