Suppr超能文献

PVT:一种加速下一代序列分析的高效计算程序。

PVT: an efficient computational procedure to speed up next-generation sequence analysis.

作者信息

Maji Ranjan Kumar, Sarkar Arijita, Khatua Sunirmal, Dasgupta Subhasis, Ghosh Zhumur

机构信息

Bioinformatics Centre, Bose Institute, Kolkata 700054, India.

出版信息

BMC Bioinformatics. 2014 Jun 4;15:167. doi: 10.1186/1471-2105-15-167.

Abstract

BACKGROUND

High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat's serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently.

RESULTS

We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during 'spliced alignment' and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time.

CONCLUSIONS

PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system.

摘要

背景

高通量下一代测序(NGS)技术正在推动基因组学和分子生物学研究。这项技术产生了大量的数据,这给科学家们带来了重大挑战,需要他们找到一种高效、经济且省时的解决方案来分析这些数据。此外,对于不同类型的NGS数据,在分析这些数据时存在某些共同的具有挑战性的步骤。剪接比对是NGS数据分析中的一个基本步骤,它计算量极大且耗时。即使是使用最广泛的剪接比对工具也存在严重问题。TopHat就是这样一种广泛使用的剪接比对工具,它虽然支持多线程,但在CPU利用率和内存方面并未有效利用计算资源。在这里,我们引入了PVT(TopHat的流水线版本),我们采用模块化方法,将TopHat的串行执行分解为多个阶段的流水线,从而提高并行化程度和计算资源利用率。因此,我们解决了TopHat中的差异,以便有效地分析大型NGS数据。

结果

我们分析了由单端读段组成的SRA数据集(SRX026839和SRX026838)以及由双端读段组成的SRA数据SRR1027730。我们使用TopHat v2.0.8分析这些数据集,并记录了剪接比对过程中的CPU使用率、内存占用和执行时间。有了这些基本信息,我们设计了PVT,即TopHat的流水线版本,它在“剪接比对”过程中消除了冗余的计算步骤,并将任务分解为多个阶段的流水线(每个阶段由不同的步骤组成)以提高其资源利用率,从而减少执行时间。

结论

PVT在NGS数据分析的剪接比对方面比TopHat有改进。因此,对于单端读段数据集,PVT将执行时间减少到约23%。此外,针对双端读段设计的PVT在执行时间方面相对于TopHat(对于所选数据)表现出约41%的性能提升。此外,我们提出了PVT-Cloud,它在云计算系统中实现了PVT流水线。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f295/4063226/88bb76473d33/1471-2105-15-167-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验