Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Bioinformatics. 2013 Dec 1;29(23):3014-9. doi: 10.1093/bioinformatics/btt528. Epub 2013 Sep 10.
The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation.
We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.
最近测序技术的革命导致了序列数据的指数级增长。结果,由于大多数当前的生物信息学工具无法与数据扩展,它们已经过时了。为了解决这个“数据泛滥”问题,我们在这里引入了 BioPig 序列分析工具包,作为一种可扩展到数据和计算的解决方案。
我们在 Apache 的 Hadoop MapReduce 系统和 Pig 数据流语言上构建了 BioPig。与传统的串行和 MPI 算法相比,BioPig 具有三个主要优势:首先,BioPig 的可编程性大大减少了并行生物信息学应用程序的开发时间;其次,使用多达 500GB 的序列对 BioPig 进行测试表明,它可以自动扩展数据规模;最后,BioPig 可以在许多 Hadoop 基础设施上进行无修改的移植,如在国家能源研究科学计算中心的 Magellan 系统和亚马逊弹性计算云中进行的测试。总之,BioPig 代表了一种具有潜在能力的新型程序框架,可以大大加速数据密集型生物信息学分析。