College of Information Science and Technology, Beijing Normal University, Beijing, China.
Department of Industrial Engineering, Pusan National University, Pusan, Republic of Korea.
J Healthc Eng. 2017;2017:1425102. doi: 10.1155/2017/1425102. Epub 2017 Mar 29.
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.
医疗保健行业产生了大量的数据,分析这些数据已经成为近年来的一个重要问题。MapReduce 编程模型已成功用于大数据分析。然而,大数据分析中总是会出现数据倾斜问题,这会严重影响效率。为了克服 MapReduce 中的数据倾斜问题,我们过去提出了一种称为基于分区调优的数据处理算法(Partition Tuning-based Skew Handling,简称 PTSH)。与传统 MapReduce 模型中使用的单阶段分区策略相比,PTSH 使用两阶段策略和分区调优方法在虚拟分区中分散键值对,并在出现数据倾斜时重新组合每个分区。我们在各种模拟数据集和真实医疗保健数据集上测试了所提出算法的健壮性和效率。结果表明,与原生 Hadoop、Closer、 locality-aware and fairness-aware key partitioning (LEEN) 相比,PTSH 算法可以有效地处理 MapReduce 中的数据倾斜,从而提高 MapReduce 作业的性能。我们还发现,采用 PTSH 算法可以显著减少规则提取所需的时间,因为它更适合于医疗保健数据的关联规则挖掘(ARM)。