Xia Xuhua
Department of Biology, University of Ottawa, Ontario K1N 6N5, Canada
Ottawa Institute of Systems Biology, Ontario K1H 8M5, Canada
G3 (Bethesda). 2017 Dec 4;7(12):3839-3848. doi: 10.1534/g3.117.300271.
Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a transcriptomic study) deposited at NCBI's SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.
高通量测序(HTS)数据分析存在两个主要障碍。第一个是文件规模庞大,未压缩时通常以千兆字节计,这给存储、传输和分析带来了问题。然而,这些文件无需如此大,可以在不损失信息的情况下进行压缩。每个HTS文件,无论是压缩的.SRA格式还是纯文本的.fastq格式,都包含许多作为单独条目存储的相同读段。例如,在存于NCBI的SRA数据库中的SRR4011234.sra文件(来自一项转录组学研究)的44,603,541条正向读段中,有一条读段有497,027个相同副本。与其将它们作为单独条目存储,人们可以且应该将它们存储为具有SeqID_NumCopy格式(我称之为FASTA+格式)的单个条目。第二个障碍是如何正确分配与旁系同源基因匹配度相同的读段。我详细阐述了一种用于此类分配的新方法。我开发了ARSDA软件来实现这些新方法。一些针对模式物种的HTS文件正在处理中,并将存于http://coevol.rdc.uottawa.ca,以证明这种方法不仅节省了大量存储空间和传输带宽,还显著减少了下游数据分析的时间。无需将497,027个相同读段分别与基因组进行比对,只需比对一次即可。ARSDA包含利用新序列格式的HTS数据进行下游数据分析(如基因表达特征分析)的功能。我对比了ARSDA和Cufflinks之间的基因表达结果,以便读者能更好地理解ARSDA的优势。ARSDA可在http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx上免费获取,适用于Windows、Linux和Macintosh计算机。