Torkian Behzad, Hann Spencer, Preisner Eva, Norman R Sean
Department of Environmental Health Sciences, University of South Carolina, 921 Assembly Street, Columbia, SC, 29208, USA.
Environ Microbiome. 2020 Aug 12;15(1):15. doi: 10.1186/s40793-020-00361-y.
The Basic Local Alignment Search Tool (BLAST) from NCBI is the preferred utility for sequence alignment and identification for bioinformatics and genomics research. Among researchers using NCBI's BLAST software, it is well known that analyzing the results of a large BLAST search can be tedious and time-consuming. Furthermore, with the recent discussions over the effects of parameters such as '-max_target_seqs' on the BLAST heuristic search process, the use of these search options are questionable. This leaves using a stand-alone parser as one of the only options of condensing these large datasets, and with few available for download online, the task is left to the researcher to create a specialized piece of software anytime they need to analyze BLAST results. The need for a streamlined and fast script that solves these issues and can be easily implemented into a variety of bioinformatics and genomics workflows was the initial motivation for developing this software.
In this study, we demonstrate the effectiveness of BLAST-QC for analysis of BLAST results and its desirability over the other available options. Applying genetic sequence data from our bioinformatic workflows, we establish BLAST_QC's superior runtime when compared to existing parsers developed with commonly used BioPerl and BioPython modules, as well as C and Java implementations of the BLAST_QC program. We discuss the 'max_target_seqs' parameter, the usage of and controversy around the use of the parameter, and offer a solution by demonstrating the ability of our software to provide the functionality this parameter was assumed to produce, as well as a variety of other parsing options. Executions of the script on example datasets are given, demonstrating the implemented functionality and providing test-cases of the program. BLAST-QC is designed to be integrated into existing software, and we establish its effectiveness as a module of workflows or other processes.
BLAST-QC provides the community with a simple, lightweight and portable Python script that allows for easy quality control of BLAST results while avoiding the drawbacks of other options. This includes the uncertain results of applying the -max_target_seqs parameter or relying on the cumbersome dependencies of other options like BioPerl, Java, etc. which add complexity and run time when running large data sets of sequences. BLAST-QC is ideal for use in high-throughput workflows and pipelines common in bioinformatic and genomic research, and the script has been designed for portability and easy integration into whatever type of processes the user may be running.
美国国立医学图书馆(NCBI)的基本局部比对搜索工具(BLAST)是生物信息学和基因组学研究中序列比对和识别的首选工具。在使用NCBI的BLAST软件的研究人员中,众所周知,分析大型BLAST搜索的结果可能既繁琐又耗时。此外,随着最近关于诸如“-max_target_seqs”等参数对BLAST启发式搜索过程的影响的讨论,这些搜索选项的使用存在疑问。这使得使用独立的解析器成为压缩这些大型数据集的少数选择之一,而且在线可供下载的解析器很少,因此研究人员在需要分析BLAST结果时不得不自行创建专门的软件。开发此软件的最初动机是需要一个简化且快速的脚本,以解决这些问题并能轻松集成到各种生物信息学和基因组学工作流程中。
在本研究中,我们证明了BLAST-QC在分析BLAST结果方面的有效性及其相对于其他可用选项的优势。应用我们生物信息学工作流程中的基因序列数据,我们确定了BLAST_QC与使用常用的BioPerl和BioPython模块以及BLAST_QC程序的C和Java实现开发的现有解析器相比,具有更优越的运行时性能。我们讨论了“max_target_seqs”参数、该参数的使用情况及其使用争议,并通过展示我们的软件能够提供该参数假定产生的功能以及各种其他解析选项,提供了一个解决方案。给出了脚本在示例数据集上的执行情况,展示了所实现的功能并提供了程序的测试用例。BLAST-QC旨在集成到现有软件中,我们确定了它作为工作流程或其他过程的一个模块的有效性。
BLAST-QC为社区提供了一个简单、轻量级且可移植的Python脚本,可轻松对BLAST结果进行质量控制,同时避免了其他选项的缺点。这包括应用“-max_target_seqs”参数的不确定结果,或依赖于BioPerl、Java等其他选项的繁琐依赖项,这些在运行大型序列数据集时会增加复杂性和运行时间。BLAST-QC非常适合用于生物信息学和基因组学研究中常见的高通量工作流程和管道,并且该脚本设计为具有可移植性,易于集成到用户可能运行的任何类型的过程中。