Department of Computer Science, COMSATS University Islamabad, Attock Campus 43600, Pakistan.
Genes (Basel). 2020 Feb 5;11(2):166. doi: 10.3390/genes11020166.
Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
下一代测序(NGS)技术产生了大量的生物数据,这带来了各种问题,例如需要处理大量数据和占用大量内存。本研究专注于检测基因组序列中的单核苷酸多态性(SNP)。目前,SNP 检测算法面临着许多问题,例如计算开销成本、准确性和内存需求。在这项研究中,我们提出了一种快速可扩展的工作流程,该流程将 Bowtie 比对器与基于 Hadoop 的 Heap SNP 调用器集成,以提高基因组序列中的 SNPs 检测效率。通过从公共可用的 Web 门户(例如 NCBI 和 DDBJ DRA)获得的基准数据集对所提出的工作流程进行了验证。已经进行了广泛的实验,并将结果与 Bowtie 和 BWA 比对器在比对阶段进行了比较,同时与 GATK、FaSD、SparkGA、Halvade 和 Heap 在 SNP 调用阶段进行了比较。实验结果分析表明,所提出的工作流程优于现有的框架,例如 GATK、FaSD、Heap 与 BWA 和 Bowtie 比对器集成、SparkGA 和 Halvade。所提出的框架在平均 F 分数上提高了 22.46%,准确率提高了 99.80%。此外,还实现了平均 0.21%的更高准确率。此外,还进行了 SNP 挖掘以识别基因组序列中的特定区域。所有框架都采用了内存管理的默认配置进行实现。观察结果表明,所有工作流程的内存需求大致相同。未来,计划以图形方式显示挖掘到的 SNPs,以实现用户友好的交互,分析和优化内存需求。