Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China.
computer science at Old Dominion University, USA.
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab070.
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
测序技术产生的基因组数据的快速增长给数据处理带来了巨大的挑战。为了解决由大量数据和复杂计算要求引起的问题,研究人员提出了许多方法和工具,这些方法和工具可以分为三种类型:大数据存储、高效算法设计和并行计算。本综述旨在研究基因组序列处理中流行的并行编程技术。根据硬件架构介绍了三种常见的并行计算模型,每个模型分为两种或三种类型,并进一步分析了其特点。然后,通过四个常见的应用程序:基因组序列比对、单核苷酸多态性调用、基因组序列预处理和模式检测与搜索,讨论了基因组序列处理的并行计算。对于每种应用,首先介绍其背景,然后从原理、硬件平台和计算效率等方面总结了工具或算法列表。每种硬件和应用的编程模型为研究人员选择高性能计算工具提供了参考。最后,我们讨论了并行计算技术的局限性和未来趋势。