College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.
Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China.
Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
随着下一代测序技术的快速发展,越来越多的基因组数据对数据处理提出了巨大的挑战。因此,我们迫切需要高度可扩展和强大的计算系统。在最先进的并行计算平台中,Apache Spark 是一个快速、通用、基于内存、迭代计算框架,用于大规模数据处理,通过引入弹性分布式数据集抽象,确保了高容错性和高可扩展性。在性能方面,Spark 在内存访问方面的速度可以比 Hadoop 快 100 倍,在磁盘访问方面的速度可以快 10 倍。此外,它还提供了 Java、Scala、Python 和 R 中的高级应用程序编程接口。它还支持一些高级组件,包括用于结构化数据处理的 Spark SQL、用于机器学习的 MLlib、用于计算图的 GraphX 和用于流计算的 Spark Streaming。我们调查了基于 Spark 的在下一代测序和其他生物领域(如表观遗传学、系统发育学和药物发现)中的应用。该调查的结果用于提供一个全面的指南,允许生物信息学研究人员在自己的领域中应用 Spark。