Gao Yufei, Zhou Yanjie, Zhou Bing, Shi Lei, Zhang Jiacai
College of Information Science and Technology, Beijing Normal University, Beijing, China
Department of Industrial Engineering, Pusan National University, Pusan, Republic of Korea
J Healthc Eng. 2017;2017. doi: 10.1155/2017/1425102.
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.
医疗保健行业产生了大量数据,近年来,对这些数据进行分析已成为一个重要问题。MapReduce编程模型已成功用于大数据分析。然而,大数据分析中总是会出现数据倾斜,这严重影响了效率。为了克服MapReduce中的数据倾斜问题,我们过去提出了一种名为基于分区调优的倾斜处理(PTSH)的数据处理算法。与传统MapReduce模型中使用的单阶段分区策略相比,PTSH采用两阶段策略和分区调优方法,将键值对分散到虚拟分区中,并在出现数据倾斜时重新组合每个分区。我们在各种模拟数据集和真实医疗保健数据集上测试了该算法的鲁棒性和效率。结果表明,与原生Hadoop、Closer以及局部感知和公平感知键分区(LEEN)相比,PTSH算法能够有效地处理MapReduce中的数据倾斜,并提高MapReduce作业的性能。我们还发现,采用PTSH算法可以显著减少规则提取所需的时间,因为它更适合对医疗保健数据进行关联规则挖掘(ARM)。