IEEE Trans Cybern. 2022 Feb;52(2):758-771. doi: 10.1109/TCYB.2020.2983073. Epub 2022 Feb 16.
Ordinal attribute has all the common characteristics of a nominal one but it differs from the nominal one by having naturally ordered possible values (also called categories interchangeably). In clustering analysis tasks, categorical data composed of both ordinal and nominal attributes (also called mixed-categorical data interchangeably) are common. Under this circumstance, existing distance and similarity measures suffer from at least one of the following two drawbacks: 1) directly treat ordinal attributes as nominal ones, and thus ignore the order information from them and 2) suppose all the attributes are independent of each other, measure the distance between two categories from a target attribute without considering the valuable information provided by the other attributes that correlate with the target one. These two drawbacks may twist the natural distances of attributes and further lead to unsatisfactory clustering results. This article, therefore, presents an entropy-based distance metric that quantifies the distance between categories by exploiting the information provided by different attributes that correlate with the target one. It also preserves the order relationship among ordinal categories during the distance measurement. Since attributes are usually correlated in different degrees, we also define the interdependence between different types of attributes to weight their contributions in forming distances. The proposed metric overcomes the two above-mentioned drawbacks for mixed-categorical data clustering. More important, it conceptually unifies the distances of ordinal and nominal attributes to avoid information loss during clustering. Moreover, it is parameter free, and will not bring extra computational cost compared to the existing state-of-the-art counterparts. Extensive experiments show the superiority of the proposed distance metric.
有序属性具有与名义属性相同的所有常见特征,但它与名义属性不同,因为它具有自然有序的可能值(也可以互换地称为类别)。在聚类分析任务中,由有序和名义属性组成的分类数据(也可以互换地称为混合分类数据)很常见。在这种情况下,现有的距离和相似性度量至少存在以下两个缺点之一:1)直接将有序属性视为名义属性,从而忽略了它们的顺序信息,2)假设所有属性彼此独立,从目标属性测量两个类别的距离,而不考虑与目标属性相关的其他属性提供的有价值信息。这两个缺点可能会扭曲属性的自然距离,并进一步导致聚类结果不理想。因此,本文提出了一种基于熵的距离度量,该度量通过利用与目标属性相关的不同属性提供的信息来量化类别之间的距离。它还在距离测量过程中保留有序类别之间的顺序关系。由于属性通常以不同的程度相关,我们还定义了不同类型属性之间的相互依赖性,以权衡它们在形成距离中的贡献。所提出的度量方法克服了混合分类数据聚类的上述两个缺点。更重要的是,它从概念上统一了有序属性和名义属性的距离,避免了聚类过程中的信息丢失。此外,它是无参数的,与现有最先进的方法相比不会带来额外的计算成本。广泛的实验表明了所提出的距离度量的优越性。