Price Morgan N, Dehal Paramvir S, Arkin Adam P
Physical Biosciences Divison, Lawrence Berkeley National Laboratory, Berkeley, California, USA.
PLoS One. 2008;3(10):e3589. doi: 10.1371/journal.pone.0003589. Epub 2008 Oct 31.
All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.
METHODOLOGY/PRINCIPAL FINDINGS: We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database ("NR"), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.
CONCLUSIONS/SIGNIFICANCE: FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.
全对全BLAST用于在蛋白质数据库中搜索同源序列对,以识别潜在的直系同源物、发现新的蛋白质家族,并快速获取这些同源关系。随着DNA测序加速和数据集增长,全对全BLAST的计算需求变得很高。
方法/主要发现:我们提出了FastBLAST,这是一种启发式方法,可替代全对全BLAST,它依赖于从PSI-BLAST和HMMer等工具获得的蛋白质与已知家族的比对。FastBLAST通过利用这些比对和对相似序列进行聚类,避免了全对全BLAST的大部分工作。FastBLAST分两个阶段运行:第一阶段识别额外的家族并进行比对,第二阶段在生成成对比对之前,根据家族比对快速识别查询序列的同源物。对于来自非冗余Genbank数据库(“NR”)的653万个蛋白质,FastBLAST识别新家族的速度比全对全BLAST快25倍。一旦第一阶段完成,FastBLAST在不到5秒的时间内就能识别出平均查询的同源物(比BLAST快8.6倍),并且给出几乎相同的结果。对于得分高于70比特的命中结果,FastBLAST能识别每个查询中排名前3250的命中结果中的98%。
结论/意义:FastBLAST使没有超级计算机的研究团队也能够分析大型蛋白质序列数据集。FastBLAST是开源软件,可从http://microbesonline.org/fastblast获取。