Salvador-Meneses Jaime, Ruiz-Chavez Zoila, Garcia-Rodriguez Jose
Facultad de Ingeniería, Ciencias Físicas y Matemática, Universidad Central del Ecuador, Quito 170129, Ecuador.
Computer Technology Department, University of Alicante, 03080 Alicante, Spain.
Entropy (Basel). 2019 Feb 28;21(3):234. doi: 10.3390/e21030234.
The NN (k-nearest neighbors) classification algorithm is one of the most widely used non-parametric classification methods, however it is limited due to memory consumption related to the size of the dataset, which makes them impractical to apply to large volumes of data. Variations of this method have been proposed, such as condensed KNN which divides the training dataset into clusters to be classified, other variations reduce the input dataset in order to apply the algorithm. This paper presents a variation of the NN algorithm, of the type structure less NN, to work with categorical data. Categorical data, due to their nature, can be compressed in order to decrease the memory requirements at the time of executing the classification. The method proposes a previous phase of compression of the data to then apply the algorithm on the compressed data. This allows us to maintain the whole dataset in memory which leads to a considerable reduction of the amount of memory required. Experiments and tests carried out on known datasets show the reduction in the volume of information stored in memory and maintain the accuracy of the classification. They also show a slight decrease in processing time because the information is decompressed in real time (on-the-fly) while the algorithm is running.
NN(k近邻)分类算法是应用最为广泛的非参数分类方法之一,然而,由于与数据集大小相关的内存消耗,该算法存在局限性,这使得它在处理大量数据时并不实用。人们已经提出了该方法的多种变体,比如凝聚KNN,它将训练数据集划分为多个待分类的簇,其他变体则通过减少输入数据集来应用该算法。本文提出了一种NN算法的变体,即无结构NN,用于处理分类数据。分类数据因其性质,可以进行压缩,以降低执行分类时的内存需求。该方法建议在数据上进行一个预压缩阶段,然后在压缩后的数据上应用算法。这使我们能够将整个数据集保存在内存中,从而大幅减少所需的内存量。在已知数据集上进行的实验和测试表明,内存中存储的信息量有所减少,同时分类的准确性得以保持。实验还表明,处理时间略有减少,因为在算法运行时信息是实时(动态)解压缩的。