Mohamadi Hamid, Vandervalk Benjamin P, Raymond Anthony, Jackman Shaun D, Chu Justin, Breshears Clay P, Birol Inanc
Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada; Department of Bioinformatics, University of British Columbia, Vancouver, BC, Canada; Intel Health and Life Sciences, Intel Corporation, Hillsboro, OR, US.
Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada.
PLoS One. 2015 Apr 29;10(4):e0126409. doi: 10.1371/journal.pone.0126409. eCollection 2015.
One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use.
受高通量测序数据泛滥影响的生物信息学中的一个重要应用是序列比对问题,即针对目标查询核苷酸或氨基酸序列以找到相似度高的区域。当查询数量过多和/或目标过大时,比对过程在计算上就会变得具有挑战性。这通常通过预处理技术来解决,即在搜索匹配项时对查询和/或目标进行索引以便于访问。当目标是静态的,例如在已建立的参考基因组中时,索引成本可通过重用生成的索引来摊销。然而,当目标是非静态的,例如在从头组装过程的中间步骤中的重叠群时,每次运行都必须计算一个新的索引。为了解决此类可扩展性问题,我们提出了DIDA,这是一个新颖的框架,它将索引和比对任务分布到一组计算节点上的较小子任务中。它提供了一种超越简单并行实现常见做法的工作流程。就内存使用和运行时而言,DIDA是用于序列比对问题的一种经济高效、可扩展且模块化的框架。它可用于大规模比对以绘制基因组草图和从头组装运行的中间阶段。DIDA的源代码、示例文件和用户手册可通过http://www.bcgsc.ca/platform/bioinfo/software/dida获取。该软件根据不列颠哥伦比亚癌症机构许可(BCCA)发布,供学术使用免费。