Aalto University, Department of Information and Computer Science, Aalto, Finland.
Bioinformatics. 2012 Mar 15;28(6):876-7. doi: 10.1093/bioinformatics/bts054. Epub 2012 Feb 2.
Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.
Hadoop-BAM 是一个用于在 Hadoop 分布式计算框架中对对齐的下一代测序数据进行可扩展操作的新型库。它充当分析应用程序和使用 Hadoop 处理的 BAM 文件之间的集成层。Hadoop-BAM 通过提供一个方便的 API 来解决与 BAM 数据访问相关的问题,该 API 可用于实现直接操作 BAM 记录的映射和减少功能。它建立在 Picard SAM JDK 之上,因此依赖于 Picard API 的工具预计将很容易转换为支持大规模分布式处理。在本文中,我们通过为 Chipster 基因组浏览器构建覆盖范围汇总工具来演示 Hadoop-BAM 的使用。我们的结果表明,Hadoop 提供了良好的可扩展性,并且应该避免在分析步骤之间在 Hadoop 内外移动数据。