ARSDA：一种存储、传输和分析转录组数据的新方法。

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.

作者信息

Xia Xuhua

机构信息

Department of Biology, University of Ottawa, Ontario K1N 6N5, Canada

Ottawa Institute of Systems Biology, Ontario K1H 8M5, Canada

出版信息

G3 (Bethesda). 2017 Dec 4;7(12):3839-3848. doi: 10.1534/g3.117.300271.

DOI:10.1534/g3.117.300271

PMID:29079682

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5714481/

Abstract

Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a transcriptomic study) deposited at NCBI's SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.

摘要

高通量测序（HTS）数据分析存在两个主要障碍。第一个是文件规模庞大，未压缩时通常以千兆字节计，这给存储、传输和分析带来了问题。然而，这些文件无需如此大，可以在不损失信息的情况下进行压缩。每个HTS文件，无论是压缩的.SRA格式还是纯文本的.fastq格式，都包含许多作为单独条目存储的相同读段。例如，在存于NCBI的SRA数据库中的SRR4011234.sra文件（来自一项转录组学研究）的44,603,541条正向读段中，有一条读段有497,027个相同副本。与其将它们作为单独条目存储，人们可以且应该将它们存储为具有SeqID_NumCopy格式（我称之为FASTA+格式）的单个条目。第二个障碍是如何正确分配与旁系同源基因匹配度相同的读段。我详细阐述了一种用于此类分配的新方法。我开发了ARSDA软件来实现这些新方法。一些针对模式物种的HTS文件正在处理中，并将存于http://coevol.rdc.uottawa.ca，以证明这种方法不仅节省了大量存储空间和传输带宽，还显著减少了下游数据分析的时间。无需将497,027个相同读段分别与基因组进行比对，只需比对一次即可。ARSDA包含利用新序列格式的HTS数据进行下游数据分析（如基因表达特征分析）的功能。我对比了ARSDA和Cufflinks之间的基因表达结果，以便读者能更好地理解ARSDA的优势。ARSDA可在http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx上免费获取，适用于Windows、Linux和Macintosh计算机。

相似文献

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.ARSDA：一种存储、传输和分析转录组数据的新方法。

G3 (Bethesda). 2017 Dec 4;7(12):3839-3848. doi: 10.1534/g3.117.300271.

Broom: application for non-redundant storage of high throughput sequencing data.Broom：用于高通量测序数据非冗余存储的应用程序。

Bioinformatics. 2019 Jan 1;35(1):143-145. doi: 10.1093/bioinformatics/bty580.

BEETL-fastq: a searchable compressed archive for DNA reads.BEETL-fastq：一种用于DNA读数的可搜索压缩存档。

Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.

Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。

Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.

jackalope: A swift, versatile phylogenomic and high-throughput sequencing simulator.狼兔：一种快速、通用的系统发育基因组学和高通量测序模拟程序。

Mol Ecol Resour. 2020 Jul;20(4):1132-1140. doi: 10.1111/1755-0998.13173. Epub 2020 May 20.

Molecular Profiling of RNA Tumors Using High-Throughput RNA Sequencing: From Raw Data to Systems Level Analyses.使用高通量RNA测序对RNA肿瘤进行分子分析：从原始数据到系统水平分析

Methods Mol Biol. 2019;1908:185-204. doi: 10.1007/978-1-4939-9004-7_13.

Compression of structured high-throughput sequencing data.结构化高通量测序数据的压缩。

PLoS One. 2013 Nov 18;8(11):e79871. doi: 10.1371/journal.pone.0079871. eCollection 2013.

Analysis of RNA-Seq Data Using TopHat and Cufflinks.使用TopHat和Cufflinks分析RNA测序数据

Methods Mol Biol. 2016;1374:339-61. doi: 10.1007/978-1-4939-3167-5_18.

ddSeeker: a tool for processing Bio-Rad ddSEQ single cell RNA-seq data.ddSeeker：一种用于处理 Bio-Rad ddSEQ 单细胞 RNA-seq 数据的工具。

BMC Genomics. 2018 Dec 24;19(1):960. doi: 10.1186/s12864-018-5249-x.

Bioinformatics Pipeline for Transcriptome Sequencing Analysis.转录组测序分析的生物信息学流程

Methods Mol Biol. 2017;1468:201-19. doi: 10.1007/978-1-4939-4035-6_14.

引用本文的文献

Does Require Specific Post-Translational Silencing against Leaky Translation of Hac1up?是否需要针对Hac1up的渗漏翻译进行特定的翻译后沉默？

Microorganisms. 2021 Mar 17;9(3):620. doi: 10.3390/microorganisms9030620.

Unique Shine-Dalgarno Sequences in Cyanobacteria and Chloroplasts Reveal Evolutionary Differences in Their Translation Initiation.蓝藻和叶绿体中独特的 Shine-Dalgarno 序列揭示了它们在翻译起始方面的进化差异。

Genome Biol Evol. 2019 Nov 1;11(11):3194-3206. doi: 10.1093/gbe/evz227.

Translation Control of by Regulation of Splicing in .通过调控剪接对的翻译控制

Int J Mol Sci. 2019 Jun 12;20(12):2860. doi: 10.3390/ijms20122860.

A computational system for identifying operons based on RNA-seq data.基于 RNA-seq 数据的操纵子识别计算系统。

Methods. 2020 Apr 1;176:62-70. doi: 10.1016/j.ymeth.2019.03.026. Epub 2019 Apr 4.

An improved estimation of tRNA expression to better elucidate the coevolution between tRNA abundance and codon usage in bacteria.一种改进的 tRNA 表达估计方法，以更好地阐明细菌中 tRNA 丰度和密码子使用之间的共进化关系。

Sci Rep. 2019 Feb 28;9(1):3184. doi: 10.1038/s41598-019-39369-x.

RNA-Seq-Based Analysis Reveals Heterogeneity in Mature 16S rRNA 3' Termini and Extended Anti-Shine-Dalgarno Motifs in Bacterial Species.基于RNA测序的分析揭示了细菌物种中成熟16S rRNA 3'末端的异质性以及延伸的反SD序列基序。

G3 (Bethesda). 2018 Dec 10;8(12):3973-3979. doi: 10.1534/g3.118.200729.

Elucidating the 16S rRNA 3' boundaries and defining optimal SD/aSD pairing in Escherichia coli and Bacillus subtilis using RNA-Seq data.利用 RNA-Seq 数据阐明大肠杆菌和枯草芽孢杆菌 16S rRNA 的 3' 边界并确定最佳的 SD/aSD 配对。

Sci Rep. 2017 Dec 15;7(1):17639. doi: 10.1038/s41598-017-17918-6.

How Changes in Anti-SD Sequences Would Affect SD Sequences in and .抗 SD 序列的变化如何影响和中的 SD 序列。

G3 (Bethesda). 2017 May 5;7(5):1607-1615. doi: 10.1534/g3.117.039305.

本文引用的文献

DAMBE6: New Tools for Microbial Genomics, Phylogenetics, and Molecular Evolution.DAMBE6：微生物基因组学、系统发育学和分子进化的新工具。

J Hered. 2017 Jun 1;108(4):431-437. doi: 10.1093/jhered/esx033.

Comparison of high-throughput sequencing data compression tools.高通量测序数据压缩工具比较。

Nat Methods. 2016 Dec;13(12):1005-1008. doi: 10.1038/nmeth.4037. Epub 2016 Oct 24.

deBGA: read alignment with de Bruijn graph-based seed and extension.deBGA：基于 de Bruijn 图的种子和扩展进行读对齐。

Bioinformatics. 2016 Nov 1;32(21):3224-3232. doi: 10.1093/bioinformatics/btw371. Epub 2016 Jul 4.

High-throughput sequencing of partially edited trypanosome mRNAs reveals barriers to editing progression and evidence for alternative editing.部分编辑的锥虫mRNA的高通量测序揭示了编辑进程的障碍以及替代编辑的证据。

RNA. 2016 May;22(5):677-95. doi: 10.1261/rna.055160.115. Epub 2016 Feb 23.

Selection preserves Ubiquitin Specific Protease 4 alternative exon skipping in therian mammals.选择保留了兽类哺乳动物中泛素特异性蛋白酶4的可变外显子跳跃。

Sci Rep. 2016 Feb 2;6:20039. doi: 10.1038/srep20039.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.使用概率性德布鲁因图对高通量测序数据进行无参考压缩

BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.

Widespread alternative and aberrant splicing revealed by lariat sequencing.套索测序揭示广泛存在的可变剪接和异常剪接

Nucleic Acids Res. 2015 Sep 30;43(17):8488-501. doi: 10.1093/nar/gkv763. Epub 2015 Aug 10.

LFQC: a lossless compression algorithm for FASTQ files.LFQC：一种用于FASTQ文件的无损压缩算法。

Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.

Next generation sequencing analysis reveals that the ribonucleases RNase II, RNase R and PNPase affect bacterial motility and biofilm formation in E. coli.新一代测序分析表明，核糖核酸酶RNase II、RNase R和PNPase影响大肠杆菌的运动性和生物膜形成。

BMC Genomics. 2015 Feb 14;16(1):72. doi: 10.1186/s12864-015-1237-6.

Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。

Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ARSDA：一种存储、传输和分析转录组数据的新方法。

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献