• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

移动足够的深度测序数据以完成工作。

Moving Just Enough Deep Sequencing Data to Get the Job Done.

作者信息

Mills Nicholas, Bensman Ethan M, Poehlman William L, Ligon Walter B, Feltus F Alex

机构信息

Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA.

School of Computing, Clemson University, Clemson, SC, USA.

出版信息

Bioinform Biol Insights. 2019 Jun 14;13:1177932219856359. doi: 10.1177/1177932219856359. eCollection 2019.

DOI:10.1177/1177932219856359
PMID:31236009
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6572328/
Abstract

MOTIVATION

As the size of high-throughput DNA sequence datasets continues to grow, the cost of transferring and storing the datasets may prevent their processing in all but the largest data centers or commercial cloud providers. To lower this cost, it should be possible to process only a subset of the original data while still preserving the biological information of interest.

RESULTS

Using 4 high-throughput DNA sequence datasets of differing sequencing depth from 2 species as use cases, we demonstrate the effect of processing partial datasets on the number of detected RNA transcripts using an RNA-Seq workflow. We used transcript detection to decide on a cutoff point. We then physically transferred the minimal partial dataset and compared with the transfer of the full dataset, which showed a reduction of approximately 25% in the total transfer time. These results suggest that as sequencing datasets get larger, one way to speed up analysis is to simply transfer the minimal amount of data that still sufficiently detects biological signal.

AVAILABILITY

All results were generated using public datasets from NCBI and publicly available open source software.

摘要

动机

随着高通量DNA序列数据集规模持续增长,转移和存储这些数据集的成本可能会阻碍除最大的数据中心或商业云提供商之外的机构对其进行处理。为降低这一成本,应该有可能仅处理原始数据的一个子集,同时仍保留感兴趣的生物学信息。

结果

以来自两个物种的4个不同测序深度的高通量DNA序列数据集作为用例,我们使用RNA测序工作流程展示了处理部分数据集对检测到的RNA转录本数量的影响。我们利用转录本检测来确定一个截止点。然后,我们实际转移了最小的部分数据集,并与完整数据集的转移进行比较,结果表明总转移时间减少了约25%。这些结果表明,随着测序数据集规模增大,加快分析速度的一种方法是简单地转移仍能充分检测到生物学信号的最小量数据。

可用性

所有结果均使用来自NCBI的公共数据集和公开可用的开源软件生成。

相似文献

1
Moving Just Enough Deep Sequencing Data to Get the Job Done.移动足够的深度测序数据以完成工作。
Bioinform Biol Insights. 2019 Jun 14;13:1177932219856359. doi: 10.1177/1177932219856359. eCollection 2019.
2
Geoseq: a tool for dissecting deep-sequencing datasets.Geoseq:一种用于解析高通量测序数据集的工具。
BMC Bioinformatics. 2010 Oct 12;11:506. doi: 10.1186/1471-2105-11-506.
3
FastqPuri: high-performance preprocessing of RNA-seq data.FastqPuri:RNA-seq 数据的高性能预处理。
BMC Bioinformatics. 2019 May 3;20(1):226. doi: 10.1186/s12859-019-2799-0.
4
The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.CAIRR 管道用于向国家生物技术信息中心存储库提交符合标准的 B 和 T 细胞受体文库测序研究。
Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.
5
CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment.CLUSTOM-CLOUD:用于在云环境中对16S rRNA序列数据进行聚类的基于内存数据网格的软件。
PLoS One. 2016 Mar 8;11(3):e0151064. doi: 10.1371/journal.pone.0151064. eCollection 2016.
6
Grape RNA-Seq analysis pipeline environment.葡萄 RNA-Seq 分析管道环境。
Bioinformatics. 2013 Mar 1;29(5):614-21. doi: 10.1093/bioinformatics/btt016. Epub 2013 Jan 17.
7
damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets.DamID序列分析流程:一种用于处理DamID测序数据集的自动化流程。
Bioinformatics. 2015 Oct 15;31(20):3371-3. doi: 10.1093/bioinformatics/btv386. Epub 2015 Jun 25.
8
Threshold-seq: a tool for determining the threshold in short RNA-seq datasets.阈值测序(Threshold-seq):一种用于确定短 RNA-seq 数据集阈值的工具。
Bioinformatics. 2017 Jul 1;33(13):2034-2036. doi: 10.1093/bioinformatics/btx073.
9
ENANO: Encoder for NANOpore FASTQ files.ENANO:用于 Nanopore FASTQ 文件的编码器。
Bioinformatics. 2020 Aug 15;36(16):4506-4507. doi: 10.1093/bioinformatics/btaa551.
10
Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco.使用 Falco 实现全长单细胞 RNA-seq 数据的云加速比对和组装。
BMC Genomics. 2019 Dec 30;20(Suppl 10):927. doi: 10.1186/s12864-019-6341-6.

引用本文的文献

1
Named Data Networking for Genomics Data Management and Integrated Workflows.用于基因组数据管理和集成工作流程的命名数据网络
Front Big Data. 2021 Feb 15;4:582468. doi: 10.3389/fdata.2021.582468. eCollection 2021.

本文引用的文献

1
Ensembl 2018.Ensembl 2018.
Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761. doi: 10.1093/nar/gkx1098.
2
Discovering Condition-Specific Gene Co-Expression Patterns Using Gaussian Mixture Models: A Cancer Case Study.利用高斯混合模型发现条件特异性基因共表达模式:癌症案例研究。
Sci Rep. 2017 Aug 17;7(1):8617. doi: 10.1038/s41598-017-09094-4.
3
Loss of tumor suppressor KDM6A amplifies PRC2-regulated transcriptional repression in bladder cancer and can be targeted through inhibition of EZH2.抑癌基因 KDM6A 的缺失可增强膀胱癌中 PRC2 调控的转录抑制,可通过抑制 EZH2 进行靶向治疗。
Sci Transl Med. 2017 Feb 22;9(378). doi: 10.1126/scitranslmed.aai8312.
4
Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.基于 HISAT、StringTie 和 Ballgown 的 RNA-seq 实验的转录本水平表达分析。
Nat Protoc. 2016 Sep;11(9):1650-67. doi: 10.1038/nprot.2016.095. Epub 2016 Aug 11.
5
Tuning the Transcriptional Response to Hypoxia by Inhibiting Hypoxia-inducible Factor (HIF) Prolyl and Asparaginyl Hydroxylases.通过抑制缺氧诱导因子(HIF)脯氨酰和天冬酰胺酰羟化酶来调节对缺氧的转录反应
J Biol Chem. 2016 Sep 23;291(39):20661-73. doi: 10.1074/jbc.M116.749291. Epub 2016 Aug 8.
6
OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid.OSG-GEM:利用开放科学网格构建基因表达矩阵
Bioinform Biol Insights. 2016 Aug 2;10:133-41. doi: 10.4137/BBI.S38193. eCollection 2016.
7
A Genetic Porcine Model of Cancer.一种癌症的基因猪模型。
PLoS One. 2015 Jul 1;10(7):e0128864. doi: 10.1371/journal.pone.0128864. eCollection 2015.
8
HISAT: a fast spliced aligner with low memory requirements.HISAT:一种内存需求低的快速剪接比对器。
Nat Methods. 2015 Apr;12(4):357-60. doi: 10.1038/nmeth.3317. Epub 2015 Mar 9.
9
StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.StringTie能够从RNA测序读数中更完善地重建转录组。
Nat Biotechnol. 2015 Mar;33(3):290-5. doi: 10.1038/nbt.3122. Epub 2015 Feb 18.
10
Trimmomatic: a flexible trimmer for Illumina sequence data.Trimmomatic:一款适用于 Illumina 测序数据的灵活修剪工具。
Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.