• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大数据生态系统中数据驱动智能算法的性能评估

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem.

作者信息

Junaid Muhammad, Ali Sajid, Siddiqui Isma Farah, Nam Choonsung, Qureshi Nawab Muhammad Faseeh, Kim Jaehyoun, Shin Dong Ryeol

机构信息

Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, South Korea.

Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea.

出版信息

Wirel Pers Commun. 2022;126(3):2403-2423. doi: 10.1007/s11277-021-09362-7. Epub 2022 Aug 23.

DOI:10.1007/s11277-021-09362-7
PMID:36033548
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9396610/
Abstract

Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform's qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used "SUSY," "HIGGS," "BANK," and "HEPMASS" dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.

摘要

该研究小组已采用多种方法应用人工智能,特别是机器学习,将多个数据源转化为有价值的事实和认识,从而具备卓越的模式识别能力。机器学习算法用于处理海量复杂数据集,然而计算成本高昂,处理过程需要硬件和逻辑资源,如空间、CPU和内存。随着每日产生的数据量达到千万亿字节,复杂的大数据基础设施变得越来越重要。Apache Spark机器学习库(ML-lib)是一个用于大数据分析的著名平台,它包含机器学习应用的多个有用功能,包括回归、分类、降维,以及聚类和特征提取。在本论文中,我们将Apache Spark ML-lib视为一个计算独立的机器学习库,它是开源、分布式、可扩展的平台。我们评估并比较了多种机器学习算法以分析该平台的性能,将Apache Spark ML-lib与另外两个大数据和机器学习处理平台Rapid Miner和Sklearn进行了比较。逻辑分类器(LC)、决策树分类器(DTc)、随机森林分类器(RFC)和梯度提升树分类器(GBTC)是在各平台间进行比较的四种机器学习算法。此外,我们还在SUSY和希格斯数据集上测试了线性回归器(LR)、决策树回归器(DTR)、随机森林回归器(RFR)和梯度提升树回归器(GBTR)等一般回归方法。此外,我们在SUSY和Hepmass数据集上评估了K均值和高斯混合模型等无监督学习方法,以确定PySpark与分类和回归模型相比的稳健性。我们使用了UCI数据存储库中的“SUSY”“HIGGS”“BANK”和“He pmass”数据集。我们还讨论了大数据机器研究的最新进展,并提供了未来的研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/2ff735b219e9/11277_2021_9362_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/7d40b686df48/11277_2021_9362_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/34b16e8d6951/11277_2021_9362_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/aa19155dd0de/11277_2021_9362_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/f7b9a45c1308/11277_2021_9362_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/f83c561baac2/11277_2021_9362_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/a1519dd5d183/11277_2021_9362_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/379dc958a4d0/11277_2021_9362_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/3fec455db5a7/11277_2021_9362_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/2ff735b219e9/11277_2021_9362_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/7d40b686df48/11277_2021_9362_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/34b16e8d6951/11277_2021_9362_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/aa19155dd0de/11277_2021_9362_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/f7b9a45c1308/11277_2021_9362_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/f83c561baac2/11277_2021_9362_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/a1519dd5d183/11277_2021_9362_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/379dc958a4d0/11277_2021_9362_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/3fec455db5a7/11277_2021_9362_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d258/9396610/2ff735b219e9/11277_2021_9362_Fig9_HTML.jpg

相似文献

1
Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem.大数据生态系统中数据驱动智能算法的性能评估
Wirel Pers Commun. 2022;126(3):2403-2423. doi: 10.1007/s11277-021-09362-7. Epub 2022 Aug 23.
2
Predicting Chronic Kidney Disease Using Hybrid Machine Learning Based on Apache Spark.基于 Apache Spark 的混合机器学习预测慢性肾脏病。
Comput Intell Neurosci. 2022 Feb 23;2022:9898831. doi: 10.1155/2022/9898831. eCollection 2022.
3
Incremental Ant-Miner Classifier for Online Big Data Analytics.用于在线大数据分析的增量蚁群分类器
Sensors (Basel). 2022 Mar 13;22(6):2223. doi: 10.3390/s22062223.
4
Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark's Machine Learning in the Big Data Framework.使用 Spark 在大数据框架中的机器学习技术从 Zeek Conn 日志中检测来自 MITRE ATT&CK 框架的侦察和发现策略。
Sensors (Basel). 2022 Oct 20;22(20):7999. doi: 10.3390/s22207999.
5
Machine Learning Based Identification of Microseismic Signals Using Characteristic Parameters.基于特征参数的微震信号机器学习识别。
Sensors (Basel). 2021 Oct 20;21(21):6967. doi: 10.3390/s21216967.
6
The derived demand for advertising expenses and implications on sustainability: a comparative study using deep learning and traditional machine learning methods.广告费用的派生需求及其对可持续性的影响:一项使用深度学习和传统机器学习方法的比较研究。
Ann Oper Res. 2022 Jan 7:1-31. doi: 10.1007/s10479-021-04429-x.
7
Efficient learning from big data for cancer risk modeling: A case study with melanoma.从大数据中高效学习进行癌症风险建模:以黑色素瘤为例的研究。
Comput Biol Med. 2019 Jul;110:29-39. doi: 10.1016/j.compbiomed.2019.04.039. Epub 2019 Apr 30.
8
Big data clustering techniques based on Spark: a literature review.基于Spark的大数据聚类技术:文献综述
PeerJ Comput Sci. 2020 Nov 30;6:e321. doi: 10.7717/peerj-cs.321. eCollection 2020.
9
A health informatics transformation model based on intelligent cloud computing - exemplified by type 2 diabetes mellitus with related cardiovascular diseases.基于智能云计算的健康信息学转化模型——以 2 型糖尿病及其相关心血管疾病为例。
Comput Methods Programs Biomed. 2020 Jul;191:105409. doi: 10.1016/j.cmpb.2020.105409. Epub 2020 Feb 25.
10
Random Bits Forest: a Strong Classifier/Regressor for Big Data.
Sci Rep. 2016 Jul 22;6:30086. doi: 10.1038/srep30086.

引用本文的文献

1
Machine-Learning Algorithms for Process Condition Data-Based Inclusion Prediction in Continuous-Casting Process: A Case Study.基于过程条件数据的连铸过程夹杂物预测的机器学习算法:案例研究
Sensors (Basel). 2023 Jul 27;23(15):6719. doi: 10.3390/s23156719.