Junaid Muhammad, Ali Sajid, Siddiqui Isma Farah, Nam Choonsung, Qureshi Nawab Muhammad Faseeh, Kim Jaehyoun, Shin Dong Ryeol
Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, South Korea.
Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea.
Wirel Pers Commun. 2022;126(3):2403-2423. doi: 10.1007/s11277-021-09362-7. Epub 2022 Aug 23.
Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform's qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used "SUSY," "HIGGS," "BANK," and "HEPMASS" dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.
该研究小组已采用多种方法应用人工智能,特别是机器学习,将多个数据源转化为有价值的事实和认识,从而具备卓越的模式识别能力。机器学习算法用于处理海量复杂数据集,然而计算成本高昂,处理过程需要硬件和逻辑资源,如空间、CPU和内存。随着每日产生的数据量达到千万亿字节,复杂的大数据基础设施变得越来越重要。Apache Spark机器学习库(ML-lib)是一个用于大数据分析的著名平台,它包含机器学习应用的多个有用功能,包括回归、分类、降维,以及聚类和特征提取。在本论文中,我们将Apache Spark ML-lib视为一个计算独立的机器学习库,它是开源、分布式、可扩展的平台。我们评估并比较了多种机器学习算法以分析该平台的性能,将Apache Spark ML-lib与另外两个大数据和机器学习处理平台Rapid Miner和Sklearn进行了比较。逻辑分类器(LC)、决策树分类器(DTc)、随机森林分类器(RFC)和梯度提升树分类器(GBTC)是在各平台间进行比较的四种机器学习算法。此外,我们还在SUSY和希格斯数据集上测试了线性回归器(LR)、决策树回归器(DTR)、随机森林回归器(RFR)和梯度提升树回归器(GBTR)等一般回归方法。此外,我们在SUSY和Hepmass数据集上评估了K均值和高斯混合模型等无监督学习方法,以确定PySpark与分类和回归模型相比的稳健性。我们使用了UCI数据存储库中的“SUSY”“HIGGS”“BANK”和“He pmass”数据集。我们还讨论了大数据机器研究的最新进展,并提供了未来的研究方向。