• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于Spark的大数据聚类技术:文献综述

Big data clustering techniques based on Spark: a literature review.

作者信息

Saeed Mozamel M, Al Aghbari Zaher, Alsharidah Mohammed

机构信息

Department of Computer Science, Prince Sattam Bin Abdul Aziz, Riyadh, Saudi Arabia.

Department of Computer Science, University of Sharjah, Sharjah, United Arab Emirates.

出版信息

PeerJ Comput Sci. 2020 Nov 30;6:e321. doi: 10.7717/peerj-cs.321. eCollection 2020.

DOI:10.7717/peerj-cs.321
PMID:33816971
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7924475/
Abstract

A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010-2020. This survey also highlights the new research directions in the field of clustering massive data.

摘要

一种广为人知的无监督学习方法——聚类,在数据挖掘、机器学习和模式识别中被广泛使用。该过程涉及将单个且不同的点分组到一个组中,使得它们彼此相似或与其他簇中的点不同。传统的聚类方法受到近期数据大量增长的巨大挑战。因此,一些研究工作提出了新颖的聚类方法设计,利用大数据平台(如为快速分布式海量数据处理而设计的Apache Spark)的优势。然而,基于Spark的聚类研究仍处于早期阶段。在这项系统综述中,我们从对大数据特征的支持方面研究了现有的基于Spark的聚类方法。此外,我们为基于Spark的聚类方法提出了一种新的分类法。据我们所知,尚未对基于Spark的大数据聚类进行过综述。因此,本综述旨在全面总结2010年至2020年期间使用Apache Spark进行大数据聚类领域的先前研究。本综述还突出了海量数据聚类领域的新研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/29c3a1d2e135/peerj-cs-06-321-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/99252386b15a/peerj-cs-06-321-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/2a5abab1b65c/peerj-cs-06-321-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/29c3a1d2e135/peerj-cs-06-321-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/99252386b15a/peerj-cs-06-321-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/2a5abab1b65c/peerj-cs-06-321-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/789a/7924475/29c3a1d2e135/peerj-cs-06-321-g003.jpg

相似文献

1
Big data clustering techniques based on Spark: a literature review.基于Spark的大数据聚类技术:文献综述
PeerJ Comput Sci. 2020 Nov 30;6:e321. doi: 10.7717/peerj-cs.321. eCollection 2020.
2
Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem.大数据生态系统中数据驱动智能算法的性能评估
Wirel Pers Commun. 2022;126(3):2403-2423. doi: 10.1007/s11277-021-09362-7. Epub 2022 Aug 23.
3
A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。
PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.
4
Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework for Big Data Clustering Using the Moth-Flame Bat Optimization and Sparse Fuzzy C-Means. moth-flame 优化-蝙蝠优化:基于 moth-flame 蝙蝠优化和稀疏模糊 C 均值的大数据聚类的 Map-Reduce 框架。
Big Data. 2020 Jun;8(3):203-217. doi: 10.1089/big.2019.0125. Epub 2020 May 19.
5
Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据:Apache Spark 与 MPI。
PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.
6
Social big data: Recent achievements and new challenges.社会大数据:近期成就与新挑战。
Inf Fusion. 2016 Mar;28:45-59. doi: 10.1016/j.inffus.2015.08.005. Epub 2015 Aug 28.
7
A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark.一种基于Apache Spark的并行多目标粒子群加权平均聚类算法。
Entropy (Basel). 2023 Jan 31;25(2):259. doi: 10.3390/e25020259.
8
A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark.一种基于Apache Spark的用于分析微阵列数据的混合多目标鲸鱼优化算法。
PeerJ Comput Sci. 2021 Mar 25;7:e416. doi: 10.7717/peerj-cs.416. eCollection 2021.
9
Efficient processing of complex XSD using Hive and Spark.使用Hive和Spark对复杂XSD进行高效处理。
PeerJ Comput Sci. 2021 Aug 17;7:e652. doi: 10.7717/peerj-cs.652. eCollection 2021.
10
Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark
Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

引用本文的文献

1
The effect of big data technologies usage on social competence.大数据技术的使用对社交能力的影响。
PeerJ Comput Sci. 2023 Nov 17;9:e1691. doi: 10.7717/peerj-cs.1691. eCollection 2023.
2
Design of feature selection algorithm for high-dimensional network data based on supervised discriminant projection.基于监督判别投影的高维网络数据特征选择算法设计
PeerJ Comput Sci. 2023 Jun 26;9:e1447. doi: 10.7717/peerj-cs.1447. eCollection 2023.
3
Standardized Classification of Cerebral Vasospasm after Subarachnoid Hemorrhage by Digital Subtraction Angiography.

本文引用的文献

1
CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.基于结构相似性的大规模网络分布式网络聚类算法
PLoS One. 2018 Oct 10;13(10):e0203670. doi: 10.1371/journal.pone.0203670. eCollection 2018.
蛛网膜下腔出血后脑血管痉挛的数字减影血管造影标准化分类
J Clin Med. 2022 Apr 3;11(7):2011. doi: 10.3390/jcm11072011.
4
Review of deep learning: concepts, CNN architectures, challenges, applications, future directions.深度学习综述:概念、卷积神经网络架构、挑战、应用及未来方向。
J Big Data. 2021;8(1):53. doi: 10.1186/s40537-021-00444-8. Epub 2021 Mar 31.