Mrozek Dariusz, Stępień Krzysztof, Grzesik Piotr, Małysiak-Mrozek Bożena
Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.
Department of Graphics, Computer Vision and Digital Systems, Silesian University of Technology, Gliwice, Poland.
Front Genet. 2021 Jul 13;12:699280. doi: 10.3389/fgene.2021.699280. eCollection 2021.
Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.
如今,对多组学数据进行的各种类型分析是由下一代测序(NGS)技术驱动的,这些技术可产生大量的DNA/RNA序列。尽管许多工具允许在大数据分布式环境中对NGS数据进行并行处理,但它们无法以简单的声明方式大规模提高NGS数据的质量。与此同时,大型测序项目以及与疾病分子谱分析相关的常规DNA/RNA测序以实现个性化治疗,既需要高质量的数据,也需要适当的基础设施来高效存储和处理数据。为了解决这些问题,我们采用数据湖的概念来存储和处理大型NGS数据。我们还提出了一个专用库,用于清理通过单端测序和双端测序技术获得的DNA/RNA序列。为了适应NGS数据的增长,我们的解决方案在云端具有很大的可扩展性,并且可以快速灵活地调整要处理的数据量。此外,为了简化数据清理方法的使用以及数据分析工作流程其他阶段的实现,我们的库扩展了声明式U-SQL查询语言,提供了一组用于数据提取、处理和存储的功能。我们的实验结果证明,整个解决方案支持基于NGS的多组学数据分析所需的大量存储以及高度并行、可扩展的处理要求。