Department of Computer Science, 387 Soda Hall, UC Berkeley, Berkeley, CA 94720, USA.
BMC Bioinformatics. 2013 Dec 7;14:358. doi: 10.1186/1471-2105-14-358.
Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability.
We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters-"the cloud". We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data.
The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems-such as new frameworks like Spark-for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.
通过高通量测序实验产生的歧义片段的概率分配已被证明极大地提高了 RNA-Seq 和 ChIP-Seq 分析的准确性,并且是许多其他序列普查实验的重要步骤。常用的优化方法是使用期望最大化(EM)算法的最大似然方法。但是,基于批处理的 EM 方法不适用于测序数据集的大小,近年来测序数据集的大小已大大增加。因此,当前的片段分配方法依赖于启发式或近似值来解决可操作性问题。
我们提出了一种使用 Spark 实现的分布式 EM 解决方案,Spark 是一种数据分析框架,可以通过利用数据中心内的计算集群(“云”)进行扩展。我们证明了我们的实现可以轻松扩展到数十亿个测序片段,同时提供了歧义片段的最大似然分配。该方法的准确性被证明优于最广泛使用的工具,并且当集群资源与输入数据的数量成线性扩展时,可以在恒定的时间内运行。
云计算为分析大规模高通量测序数据所面临的困难提供了一种解决方案,这些数据仍在迅速增长。生物信息学研究人员必须关注分布式系统的发展,例如像 Spark 这样的新框架,以便将现有方法移植到云中,并帮助它们扩展到未来的数据集。我们的软件 eXpress-D 可在以下网址免费获得:http://github.com/adarob/express-d。