Chen Jinxiang, Li Fuyi, Wang Miao, Li Junlong, Marquez-Lago Tatiana T, Leier André, Revote Jerico, Li Shuqin, Liu Quanzhong, Song Jiangning
Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China.
Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia.
Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.
Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.
In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.
The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
简单序列重复(SSR)是核苷酸序列的短串联重复。研究表明,SSR与人类疾病相关,具有医学相关性。因此,人们提出了多种计算方法从基因组中挖掘SSR。传统方法依赖高质量的完整基因组来识别SSR。然而,测序基因组往往会遗漏一些高度重复区域。此外,许多非模式物种没有完整的基因组。随着下一代测序(NGS)技术的最新进展,可以使用NGS快速生成任何物种的大规模序列读数。在这种情况下,人们提出了一些方法来在大量非模式物种的读数中识别数千个SSR位点。虽然市场上最常用的NGS平台(如Illumina平台)通常提供短的双端读数,但在识别SSR位点之前,合并重叠的双端读数已成为一种常见方法。这给传统的单机工具带来了大数据分析挑战,使其难以合并短读对并从大规模数据中识别SSR。
在本研究中,我们提出了一种基于Hadoop的新软件程序,称为BigFiRSt,以利用前沿大数据技术解决这一问题。BigFiRSt由两个主要模块BigFLASH和BigPERF组成,分别基于两个最先进的单机工具FLASH和PERF实现。BigFLASH和BigPERF分别以大数据方式解决合并短读对和挖掘SSR的问题。综合基准实验表明,BigFiRSt可以显著减少从超大规模DNA序列数据中快速合并读对和挖掘SSR的执行时间。
BigFiRSt的卓越性能主要得益于大数据Hadoop技术,能够在集群上并行和分布式计算中合并读对并挖掘SSR。我们预计BigFiRSt将成为即将到来的生物大数据时代的一个有价值的工具。