Zhang Yiqun, Cheung Yiu-Ming, Tan Kay Chen
IEEE Trans Neural Netw Learn Syst. 2020 Jan;31(1):39-52. doi: 10.1109/TNNLS.2019.2899381. Epub 2019 Mar 19.
Ordinal data are common in many data mining and machine learning tasks. Compared to nominal data, the possible values (also called categories interchangeably) of an ordinal attribute are naturally ordered. Nevertheless, since the data values are not quantitative, the distance between two categories of an ordinal attribute is generally not well defined, which surely has a serious impact on the result of the quantitative analysis if an inappropriate distance metric is utilized. From the practical perspective, ordinal-and-nominal-attribute categorical data, i.e., categorical data associated with a mixture of nominal and ordinal attributes, is common, but the distance metric for such data has yet to be well explored in the literature. In this paper, within the framework of clustering analysis, we therefore first propose an entropy-based distance metric for ordinal attributes, which exploits the underlying order information among categories of an ordinal attribute for the distance measurement. Then, we generalize this distance metric and propose a unified one accordingly, which is applicable to ordinal-and-nominal-attribute categorical data. Compared with the existing metrics proposed for categorical data, the proposed metric is simple to use and nonparametric. More importantly, it reasonably exploits the underlying order information of ordinal attributes and statistical information of nominal attributes for distance measurement. Extensive experiments show that the proposed metric outperforms the existing counterparts on both the real and benchmark data sets.
序数数据在许多数据挖掘和机器学习任务中很常见。与标称数据相比,序数属性的可能值(也可互换地称为类别)是自然有序的。然而,由于数据值不是定量的,序数属性的两个类别之间的距离通常没有很好地定义,如果使用不适当的距离度量,这肯定会对定量分析的结果产生严重影响。从实际角度来看,序数和标称属性分类数据,即与标称和序数属性混合相关的分类数据很常见,但此类数据的距离度量在文献中尚未得到充分探索。因此,在本文中,在聚类分析的框架内,我们首先为序数属性提出了一种基于熵的距离度量,该度量利用序数属性类别之间的潜在顺序信息进行距离测量。然后,我们对这个距离度量进行了推广,并相应地提出了一个统一的距离度量,它适用于序数和标称属性分类数据。与现有的针对分类数据提出的度量相比,所提出的度量使用简单且非参数化。更重要的是,它合理地利用了序数属性的潜在顺序信息和标称属性的统计信息进行距离测量。大量实验表明,所提出的度量在真实数据集和基准数据集上均优于现有同类度量。