Oujja Anas, Abid Mohamed Riduan, Boumhidi Jaouad, Bourhnane Safae, Mourhir Asmaa, Merchant Fatima, Benhaddou Driss
School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco.
Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco.
Genomics Inform. 2021 Dec;19(4):e49. doi: 10.5808/gi.21056. Epub 2021 Dec 31.
Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.
如今,基因组数据是全球增长最快的数据集之一。到2025年,它预计将成为第四大数据源,因此需要有足够的高性能计算(HPC)平台来进行处理。鉴于严重急性呼吸综合征冠状病毒2(SARS-CoV-2)出现了前所未有的、不可预测的最新突变,研究界迫切需要信息通信技术工具来处理SARS-CoV-2 RNA数据,例如通过对其进行分类(即聚类),从而协助追踪病毒突变并预测未来的突变。在本文中,我们展示了一种基于HPC的SARS-CoV-2 RNA聚类工具。我们采用了一种数据科学方法,从数据收集到分析再到可视化。在分析步骤中,我们展示了我们的聚类方法如何利用HPC和最长公共子序列(LCS)算法。该方法使用Hadoop MapReduce编程范式并对LCS算法进行了调整,以便有效地计算每对SARS-CoV-2 RNA序列的LCS长度。后者是从美国国家生物技术信息中心(NCBI)病毒库中提取的。计算出的LCS长度用于衡量RNA序列之间的差异,以便确定现有的聚类。除此之外,我们还基于可变工作负载和不同数量的Hadoop工作节点对LCS算法性能进行了比较研究。