Aristodimou Aristos, Diavastos Andreas, Pattichis Constantinos S
Department of Computer Science, University of Cyprus, Nicosia, Cyprus.
School of Computing, National University of Singapore, Singapore, Republic of Singapore.
Health Informatics J. 2022 Jan-Mar;28(1):14604582211065397. doi: 10.1177/14604582211065397.
Discretization is a preprocessing technique used for converting continuous features into categorical. This step is essential for processing algorithms that cannot handle continuous data as input. In addition, in the big data era, it is important for a discretizer to be able to efficiently discretize data. In this paper, a new supervised density-based discretization (DBAD) algorithm is proposed, which satisfies these requirements. For the evaluation of the algorithm, 11 datasets that cover a wide range of datasets in the medical domain were used. The proposed algorithm was tested against three state-of-the art discretizers using three classifiers with different characteristics. A parallel version of the algorithm was evaluated using two synthetic big datasets. In the majority of the performed tests, the algorithm was found performing statistically similar or better than the other three discretization algorithms it was compared to. Additionally, the algorithm was faster than the other discretizers in all of the performed tests. Finally, the parallel version of DBAD shows almost linear speedup for a Message Passing Interface (MPI) implementation (9.64× for 10 nodes), while a hybrid MPI/OpenMP implementation improves execution time by 35.3× for 10 nodes and 6 threads per node.
离散化是一种用于将连续特征转换为分类特征的预处理技术。此步骤对于无法处理连续数据作为输入的处理算法至关重要。此外,在大数据时代,离散化器能够高效地离散化数据也很重要。本文提出了一种新的基于监督密度的离散化(DBAD)算法,该算法满足这些要求。为了评估该算法,使用了11个涵盖医学领域广泛数据集的数据集。使用具有不同特征的三个分类器,将所提出的算法与三种先进的离散化器进行了测试。使用两个合成大数据集对该算法的并行版本进行了评估。在大多数执行的测试中,发现该算法的性能在统计上与它所比较的其他三种离散化算法相似或更好。此外,在所有执行的测试中,该算法都比其他离散化器更快。最后,DBAD的并行版本对于消息传递接口(MPI)实现显示出几乎线性的加速比(10个节点时为9.64倍),而混合MPI/OpenMP实现在10个节点且每个节点6个线程的情况下将执行时间提高了35.3倍。