Mills Nicholas, Bensman Ethan M, Poehlman William L, Ligon Walter B, Feltus F Alex
Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA.
School of Computing, Clemson University, Clemson, SC, USA.
Bioinform Biol Insights. 2019 Jun 14;13:1177932219856359. doi: 10.1177/1177932219856359. eCollection 2019.
As the size of high-throughput DNA sequence datasets continues to grow, the cost of transferring and storing the datasets may prevent their processing in all but the largest data centers or commercial cloud providers. To lower this cost, it should be possible to process only a subset of the original data while still preserving the biological information of interest.
Using 4 high-throughput DNA sequence datasets of differing sequencing depth from 2 species as use cases, we demonstrate the effect of processing partial datasets on the number of detected RNA transcripts using an RNA-Seq workflow. We used transcript detection to decide on a cutoff point. We then physically transferred the minimal partial dataset and compared with the transfer of the full dataset, which showed a reduction of approximately 25% in the total transfer time. These results suggest that as sequencing datasets get larger, one way to speed up analysis is to simply transfer the minimal amount of data that still sufficiently detects biological signal.
All results were generated using public datasets from NCBI and publicly available open source software.
随着高通量DNA序列数据集规模持续增长,转移和存储这些数据集的成本可能会阻碍除最大的数据中心或商业云提供商之外的机构对其进行处理。为降低这一成本,应该有可能仅处理原始数据的一个子集,同时仍保留感兴趣的生物学信息。
以来自两个物种的4个不同测序深度的高通量DNA序列数据集作为用例,我们使用RNA测序工作流程展示了处理部分数据集对检测到的RNA转录本数量的影响。我们利用转录本检测来确定一个截止点。然后,我们实际转移了最小的部分数据集,并与完整数据集的转移进行比较,结果表明总转移时间减少了约25%。这些结果表明,随着测序数据集规模增大,加快分析速度的一种方法是简单地转移仍能充分检测到生物学信号的最小量数据。
所有结果均使用来自NCBI的公共数据集和公开可用的开源软件生成。