一种基于Apache Spark的用于分析微阵列数据的混合多目标鲸鱼优化算法。

A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark.

作者信息

AbdelAziz Amr Mohamed, Soliman Taysir, Ghany Kareem Kamal A, Sewisy Adel

机构信息

Faculty of Computers and Artificial Intelligence, Beni-Suef University, Egypt.

Faculty of Computers and Information, Assiut University, Egypt.

出版信息

PeerJ Comput Sci. 2021 Mar 25;7:e416. doi: 10.7717/peerj-cs.416. eCollection 2021.

DOI:10.7717/peerj-cs.416

PMID:33834101

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8022636/

Abstract

A microarray is a revolutionary tool that generates vast volumes of data that describe the expression profiles of genes under investigation that can be qualified as Big Data. Hadoop and Spark are efficient frameworks, developed to store and analyze Big Data. Analyzing microarray data helps researchers to identify correlated genes. Clustering has been successfully applied to analyze microarray data by grouping genes with similar expression profiles into clusters. The complex nature of microarray data obligated clustering methods to employ multiple evaluation functions to ensure obtaining solutions with high quality. This transformed the clustering problem into a Multi-Objective Problem (MOP). A new and efficient hybrid Multi-Objective Whale Optimization Algorithm with Tabu Search (MOWOATS) was proposed to solve MOPs. In this article, MOWOATS is proposed to analyze massive microarray datasets. Three evaluation functions have been developed to ensure an effective assessment of solutions. MOWOATS has been adapted to run in parallel using Spark over Hadoop computing clusters. The quality of the generated solutions was evaluated based on different indices, such as Silhouette and Davies-Bouldin indices. The obtained clusters were very similar to the original classes. Regarding the scalability, the running time was inversely proportional to the number of computing nodes.

摘要

微阵列是一种革命性的工具，它能生成大量描述所研究基因表达谱的数据，这些数据堪称大数据。Hadoop和Spark是为存储和分析大数据而开发的高效框架。分析微阵列数据有助于研究人员识别相关基因。聚类已成功应用于微阵列数据的分析，它通过将具有相似表达谱的基因分组到簇中来实现。微阵列数据的复杂性使得聚类方法必须采用多个评估函数，以确保获得高质量的解决方案。这将聚类问题转化为了一个多目标问题（MOP）。为了解决多目标问题，提出了一种新的高效混合多目标鲸鱼优化算法与禁忌搜索算法（MOWOATS）。在本文中，提出使用MOWOATS来分析海量微阵列数据集。开发了三个评估函数，以确保对解决方案进行有效评估。MOWOATS已被调整为在Hadoop计算集群上使用Spark并行运行。根据不同指标（如轮廓系数和戴维斯-布尔丁指数）对生成的解决方案的质量进行评估。所获得的簇与原始类别非常相似。在可扩展性方面，运行时间与计算节点数量成反比。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9a9c/8022636/2abd3193c6dd/peerj-cs-07-416-g001.jpg

相似文献

A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark.一种基于Apache Spark的用于分析微阵列数据的混合多目标鲸鱼优化算法。

PeerJ Comput Sci. 2021 Mar 25;7:e416. doi: 10.7717/peerj-cs.416. eCollection 2021.

A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark.一种基于Apache Spark的并行多目标粒子群加权平均聚类算法。

Entropy (Basel). 2023 Jan 31;25(2):259. doi: 10.3390/e25020259.

Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据：Apache Spark 与 MPI。

PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.

A Comparison Study of Validity Indices on Swarm-Intelligence-Based Clustering.基于群体智能的聚类有效性指标比较研究

IEEE Trans Syst Man Cybern B Cybern. 2012 Aug;42(4):1243-56. doi: 10.1109/TSMCB.2012.2188509. Epub 2012 Mar 15.

A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。

PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark：一种可扩展的基于 Spark 的单倍型调用程序，利用自适应数据分段来加速变异调用。

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

Big data clustering techniques based on Spark: a literature review.基于Spark的大数据聚类技术：文献综述

PeerJ Comput Sci. 2020 Nov 30;6:e321. doi: 10.7717/peerj-cs.321. eCollection 2020.

Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Rough Based Symmetrical Clustering for Gene Expression Profile Analysis.基于粗糙集的对称聚类用于基因表达谱分析

IEEE Trans Nanobioscience. 2015 Jun;14(4):360-367. doi: 10.1109/TNB.2015.2421323. Epub 2015 Apr 29.

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.一种基于Hadoop的用于同步辐射实验的分布式数据处理方案。

J Synchrotron Radiat. 2024 May 1;31(Pt 3):635-645. doi: 10.1107/S1600577524002637. Epub 2024 Apr 24.

本文引用的文献

A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies.一种由先验生物学知识引导的多目标基因聚类算法，具备强化和多样化策略。

BioData Min. 2018 Aug 7;11:16. doi: 10.1186/s13040-018-0178-4. eCollection 2018.

Clustering gene expression time series data using an infinite Gaussian process mixture model.使用无限高斯过程混合模型对基因表达时间序列数据进行聚类。

PLoS Comput Biol. 2018 Jan 16;14(1):e1005896. doi: 10.1371/journal.pcbi.1005896. eCollection 2018 Jan.

An External Archive-Guided Multiobjective Particle Swarm Optimization Algorithm.基于外部档案的多目标粒子群优化算法。

IEEE Trans Cybern. 2017 Sep;47(9):2794-2808. doi: 10.1109/TCYB.2017.2710133. Epub 2017 Jun 12.

Hessian regularization based symmetric nonnegative matrix factorization for clustering gene expression and microbiome data.基于Hessian正则化的对称非负矩阵分解用于聚类基因表达和微生物组数据

Methods. 2016 Dec 1;111:80-84. doi: 10.1016/j.ymeth.2016.06.017. Epub 2016 Jun 20.

Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph.基于最小生成树邻域图的特征分析对相似基因进行功能分组。

Comput Biol Med. 2016 Apr 1;71:135-48. doi: 10.1016/j.compbiomed.2016.02.007. Epub 2016 Feb 21.

Modelling-based experiment retrieval: a case study with gene expression clustering.基于模型的实验检索：基因表达聚类的案例研究

Bioinformatics. 2016 May 1;32(9):1388-94. doi: 10.1093/bioinformatics/btv762. Epub 2016 Jan 6.

A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter.一种用于二元微阵列数据的新型双聚类算法：BiBinCons和BiBinAlter。

BioData Min. 2015 Nov 30;8:38. doi: 10.1186/s13040-015-0070-4. eCollection 2015.

Gene expression data clustering using a multiobjective symmetry based clustering technique.基于多目标对称的基因表达数据聚类技术。

Comput Biol Med. 2013 Nov;43(11):1965-77. doi: 10.1016/j.compbiomed.2013.07.021. Epub 2013 Sep 7.

Mining differential top-k co-expression patterns from time course comparative gene expression datasets.从时间序列比较基因表达数据集中挖掘差异的 top-k 共表达模式。

BMC Bioinformatics. 2013 Jul 21;14:230. doi: 10.1186/1471-2105-14-230.

A novel method for cross-species gene expression analysis.一种用于跨物种基因表达分析的新方法。

BMC Bioinformatics. 2013 Feb 27;14:70. doi: 10.1186/1471-2105-14-70.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种基于Apache Spark的用于分析微阵列数据的混合多目标鲸鱼优化算法。

A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献