Suppr超能文献

分析基因组序列的大数据集:快速可扩展的 k-mer 统计信息收集。

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.

机构信息

Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome, 00185, Italy.

Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Università di Roma - La Sapienza, Rome, 00185, Italy.

出版信息

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.

Abstract

BACKGROUND

Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k.

RESULTS

One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability.

CONCLUSIONS

We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.

摘要

背景

由于下一代测序技术产生了大量数据,基于 MapReduce 编程范例的分布式方法开始在生物信息学领域中被提出。然而,在效率和效果方面,使用 MapReduce 和相关的大数据技术和框架(例如 Apache Hadoop 和 Spark)并不一定能产生令人满意的结果。我们讨论了分布式和大数据管理技术的发展如何影响生物序列大数据集的分析。此外,我们展示了如何选择不同的参数配置,并针对所考虑的特定框架精心设计软件,以便实现良好的性能,特别是在处理大量数据时。我们选择 k-mer 计数作为我们分析的案例研究,并选择 Spark 来实现 FastKmer,这是一种从具有任意 k 值的大量生物序列中提取 k-mer 统计信息的新方法。

结果

FastKmer 的一个最主要的贡献是引入了一个模块,用于在计算集群的节点之间平衡统计信息聚合的工作负载,以克服数据倾斜问题,同时充分利用底层分布式架构。我们还展示了比较实验分析的结果,表明我们的方法是目前基于大数据技术的最快方法之一,同时具有很好的可扩展性。

结论

我们提供的证据表明,只有在算法设计和实现中仔细考虑架构细节和所考虑框架的特殊方面,才能使 Hadoop 或 Spark 等技术在生物序列大数据集的分析中发挥作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c57/6471689/fe6d2c5fa75b/12859_2019_2694_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验