Sahlabadi Amirhossein, Chandren Muniyandi Ravie, Sahlabadi Mahdi, Golshanbafghy Hossein
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Malaysia.
Faculty of Creative Multimedia, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia.
Adv Bioinformatics. 2018 Mar 29;2018:9391635. doi: 10.1155/2018/9391635. eCollection 2018.
Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.
如今,微阵列技术已成为研究基因表达和疾病诊断的常用方法之一。美国国家生物技术信息中心(NCBI)托管着包含大量需要预处理的生物数据的公共数据库,因为这些数据存在高水平的噪声和偏差。稳健多阵列平均法(RMA)是用于预处理数据和去除噪声的标准且常用的方法之一。大多数预处理算法耗时且无法处理包含数千个实验的大量数据集。并行处理可用于解决上述问题。Hadoop是一个著名且理想的分布式文件系统框架,它提供了一个运行实验的并行环境。在本研究中,首次利用Hadoop的能力和R的统计能力将名为RMA的可用预处理算法并行化,以高效处理微阵列数据。实验在一个包含5个节点的集群上运行,每个节点有16个核心和16GB内存。它比较了使用Hadoop并行化RMA与使用affyPara包并行化RMA以及顺序RMA的效率和性能。结果表明,所提方法的加速率优于顺序方法和affyPara方法。