• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于提高支持大型多组学数据分析的NGS数据质量的大规模无服务器计算方法。

A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.

作者信息

Mrozek Dariusz, Stępień Krzysztof, Grzesik Piotr, Małysiak-Mrozek Bożena

机构信息

Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.

Department of Graphics, Computer Vision and Digital Systems, Silesian University of Technology, Gliwice, Poland.

出版信息

Front Genet. 2021 Jul 13;12:699280. doi: 10.3389/fgene.2021.699280. eCollection 2021.

DOI:10.3389/fgene.2021.699280
PMID:34326863
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8314304/
Abstract

Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.

摘要

如今,对多组学数据进行的各种类型分析是由下一代测序(NGS)技术驱动的,这些技术可产生大量的DNA/RNA序列。尽管许多工具允许在大数据分布式环境中对NGS数据进行并行处理,但它们无法以简单的声明方式大规模提高NGS数据的质量。与此同时,大型测序项目以及与疾病分子谱分析相关的常规DNA/RNA测序以实现个性化治疗,既需要高质量的数据,也需要适当的基础设施来高效存储和处理数据。为了解决这些问题,我们采用数据湖的概念来存储和处理大型NGS数据。我们还提出了一个专用库,用于清理通过单端测序和双端测序技术获得的DNA/RNA序列。为了适应NGS数据的增长,我们的解决方案在云端具有很大的可扩展性,并且可以快速灵活地调整要处理的数据量。此外,为了简化数据清理方法的使用以及数据分析工作流程其他阶段的实现,我们的库扩展了声明式U-SQL查询语言,提供了一组用于数据提取、处理和存储的功能。我们的实验结果证明,整个解决方案支持基于NGS的多组学数据分析所需的大量存储以及高度并行、可扩展的处理要求。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/42a2059a4fb1/fgene-12-699280-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/43d04577c434/fgene-12-699280-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/e5415300c14d/fgene-12-699280-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/3860e4117bd9/fgene-12-699280-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/c019d94b3515/fgene-12-699280-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/b088531dbd80/fgene-12-699280-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/bbfacde66127/fgene-12-699280-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/0fc2c02639c6/fgene-12-699280-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/42a2059a4fb1/fgene-12-699280-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/43d04577c434/fgene-12-699280-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/e5415300c14d/fgene-12-699280-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/3860e4117bd9/fgene-12-699280-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/c019d94b3515/fgene-12-699280-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/b088531dbd80/fgene-12-699280-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/bbfacde66127/fgene-12-699280-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/0fc2c02639c6/fgene-12-699280-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a272/8314304/42a2059a4fb1/fgene-12-699280-g0008.jpg

相似文献

1
A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.一种用于提高支持大型多组学数据分析的NGS数据质量的大规模无服务器计算方法。
Front Genet. 2021 Jul 13;12:699280. doi: 10.3389/fgene.2021.699280. eCollection 2021.
2
Serverless computing in omics data analysis and integration.无服务器计算在组学数据分析和整合中的应用。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab349.
3
Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment.在 Azure 数据湖环境中可扩展地提取大分子数据。
Molecules. 2019 Jan 5;24(1):179. doi: 10.3390/molecules24010179.
4
Big data analytics in Cloud computing: an overview.云计算中的大数据分析:概述
J Cloud Comput (Heidelb). 2022;11(1):24. doi: 10.1186/s13677-022-00301-w. Epub 2022 Aug 6.
5
systemPipeR: NGS workflow and report generation environment.systemPipeR:二代测序工作流程与报告生成环境。
BMC Bioinformatics. 2016 Sep 20;17:388. doi: 10.1186/s12859-016-1241-0.
6
Cloud Computing Enabled Big Multi-Omics Data Analytics.基于云计算的大型多组学数据分析
Bioinform Biol Insights. 2021 Jul 28;15:11779322211035921. doi: 10.1177/11779322211035921. eCollection 2021.
7
RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor.RGMQL:在 R/Bioconductor 中可扩展和互操作的异构组学大数据和元数据的计算。
BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4.
8
Next generation distributed computing for cancer research.用于癌症研究的下一代分布式计算。
Cancer Inform. 2015 Apr 27;13(Suppl 7):97-109. doi: 10.4137/CIN.S16344. eCollection 2014.
9
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.BigFiRSt:一种使用大数据技术从大规模测序数据中挖掘简单序列重复序列的软件程序。
Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.
10
Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework.使用 Apache Arrow 内存数据框架优化 GATK 工作流程的性能。
BMC Genomics. 2020 Nov 18;21(Suppl 10):683. doi: 10.1186/s12864-020-07013-y.

引用本文的文献

1
Improved meta-analysis pipeline ameliorates distinctive gene regulators of diabetic vasculopathy in human endothelial cell (hECs) RNA-Seq data.改进的荟萃分析流程改善了人内皮细胞 (hECs) RNA-Seq 数据中糖尿病血管病变的独特基因调控因子。
PLoS One. 2023 Nov 9;18(11):e0293939. doi: 10.1371/journal.pone.0293939. eCollection 2023.

本文引用的文献

1
SeQuiLa-cov: A fast and scalable library for depth of coverage calculations.SeQuiLa-cov:一个快速且可扩展的覆盖深度计算库。
Gigascience. 2019 Aug 1;8(8). doi: 10.1093/gigascience/giz094.
2
Fuzzysplit: demultiplexing and trimming sequenced DNA with a declarative language.Fuzzysplit:使用声明性语言对测序DNA进行解复用和修剪
PeerJ. 2019 Jun 19;7:e7170. doi: 10.7717/peerj.7170. eCollection 2019.
3
pTrimmer: An efficient tool to trim primers of multiplex deep sequencing data.pTrimmer:一种用于修剪多重深度测序数据引物的高效工具。
BMC Bioinformatics. 2019 May 10;20(1):236. doi: 10.1186/s12859-019-2854-x.
4
An Efficient Trimming Algorithm based on Multi-Feature Fusion Scoring Model for NGS Data.基于多特征融合评分模型的 NGS 数据高效修剪算法。
IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):728-738. doi: 10.1109/TCBB.2019.2897558. Epub 2019 Feb 5.
5
FastQ Screen: A tool for multi-genome mapping and quality control.FastQ Screen:一种用于多基因组比对和质量控制的工具。
F1000Res. 2018 Aug 24;7:1338. doi: 10.12688/f1000research.15931.2. eCollection 2018.
6
Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.大异质基因组数据集的处理,用于下一代测序数据的三级分析。
Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.
7
Flexbar 3.0 - SIMD and multicore parallelization.Flexbar 3.0 - SIMD 和多核并行化。
Bioinformatics. 2017 Sep 15;33(18):2941-2942. doi: 10.1093/bioinformatics/btx330.
8
AfterQC: automatic filtering, trimming, error removing and quality control for fastq data.QC之后:对fastq数据进行自动过滤、修剪、错误去除和质量控制。
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):80. doi: 10.1186/s12859-017-1469-3.
9
Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.用于综合处理和查询的异构基因组大数据建模与互操作性
Methods. 2016 Dec 1;111:3-11. doi: 10.1016/j.ymeth.2016.09.002. Epub 2016 Sep 13.
10
SeqPurge: highly-sensitive adapter trimming for paired-end NGS data.SeqPurge:用于双端NGS数据的高灵敏度接头修剪
BMC Bioinformatics. 2016 May 10;17:208. doi: 10.1186/s12859-016-1069-7.